Legal questions on data scraping

meburningslime · January 19, 2025, 7:35am

Hello all,
As of late I have been attempting to bolster my scripting skills. As such, one project I found was data scraping (spoiler below is info for those who are curious). It piqued my interest and I was very excited to try it out, but quickly found that there could be legal issues if the owners disallow it. What is the official stance on this, and would I be allowed to make an application to index all flowlab games? I would share my results with the community if it is. Thank you all,
meburningslime.

What the honk is a data scraper

A data scraper is a piece of code that systematically searches for information on websites. This is what Google does to check websites for relevancy for search terms. When you search “help I accidentally summoned a lemon,” any websites that have the words “help,” “lemon,” etc. will be fed to you. A web spider / web crawler (ever wondered why it was called the web?) will search for links on pages and keep following those links to find new websites and scrape the data off of them. If you have ever found a page and it says “no description available” on the site summary, this is because the website administrator(s) denied access to those data scrapers. Most websites with realtime/sensitive info like Trivago, news outlets, and others use CAPTCHAs, no-robot requests, or simply forcing you to click a button to view the contents to stop these data scrapers. You can get in real big legal trouble if you scrape websites that shouldnt be, so I was just curious to see if I could use flowlab as a test site for my fun project.

@grazer @JR01 @Samuel_Tomé_PixelPizza

JR01 · January 19, 2025, 9:46am

If the owner denies scraping then it should already be prevented from accessing that information, probably with login measures. Flowlab is publicly access through the internet, and many things from flowlab and this forum is already searchable on google via web crawlers.

The only thing I have to say is that because everything is web-hosted, anything loaded on flowlab like art, music, and formatting can be directly access through the search engine developer console. I just say just dont steal art from games. Anything account related shouldn’t be scrapped since that has to be confirmed with account servers for the website, and is also how web crawlers can’t access parts of the web by locked out pages.

Another thing if you do make an application with this data is to probably just list public games and not unlisted games that can still be accessed with a url. But if you do scrape flowlab games, good luck sorting through all the ‘new games’.

grazer · January 19, 2025, 3:18pm

Hey @meburningslime - thanks for asking before tackling this project

In general, I think that anything posted publicly on the website is public information, so there’s nothing wrong with scraping and compiling information you could otherwise get using a web browser for your own purposes.

The downside is that there are lots of bots and web scrapers crawling websites all the time, and if they misbehave, they can easily:

degrade the experience for everyone using the site
trigger large bandwidth bills for me (or you)
generally cause a lot of headaches

To help combat these problems, there are “rules” around using bots on websites, defined by a file called robots.txt (robots.txt - Wikipedia). Here’s the one for Flowlab: https://flowlab.io/robots.txt

So if you do decide to write a web scraper, make sure you do the following:

Choose a distinctive User-Agent, like “meburningslime-bot” or whatever, so that it can be given rules in the robots.txt file
Parse and follow the robots.txt rules for your bot. (note that many bots are not allowed to access the individual game pages, since they are too aggressive, but google bot is allowed there right now)
Please do not fetch images/videos/audio files (just html and text data)
Please use a reasonable delay between requests (e.g. at least 1 second, or whatever is specified for your bot in the robots.txt), and do not send requests in a tight loop, as this will put an undue load on the server and degrade the experience for everyone.
Please do not attempt to fetch the games past page 100 (you will find that these games are not publicly accessible to web bots like google, etc)
Please only fetch public data - don’t e.g. make a Flowlab account and have your bot attempt to log in and use that account.

When I find bots that are not well behaved, I ban their IPs and / or Flowlab accounts if they have them. This has happened many times, but it’s usually because the owner of the bot is being a bad actor intentionally, and trying to do something like drive up game play numbers.

If you do want to tackle this project, please DM me on Discord or send me an email and let me know what you are doing before you start running it, so that I can keep an eye on the server side and let you know if the bot is misbehaving, so that you can resolve any issues instead of getting your bot banned if it starts causing problems.