According to the US court systems the robots.txt file is meaningless. If they respond with a 200 status code giving you the access then you can legally scrape it all you want. If they require that you log in then you have to follow the terms you agree to when creating an account. Public means public though, and if Reddit doesn't want to make the content private (put it behind a login) then we can scrape away.
Note that scraping, regardless of the level of permission, doesn't mean you can do anything you want with the content. Copyright still applies. But you can scrape it, and if your use falls under Fair Use or another caveat to the copyright laws then you can do ahead and do it without needing any permission from the authors.
I liked the chapter on DMCA from the 5-volume E-Commerce &
Internet Law. It was super detailed.
I haven’t read volume 1, but apparently half of it is about data scraping, and I expect it to be similarly detailed. So if I were you, that’s where I’d start.
Another option is looking for “robots.txt” at Google Scholar and trying various keywords like “legality”, “scraping”, “case law”, etc.
Note that scraping, regardless of the level of permission, doesn't mean you can do anything you want with the content. Copyright still applies. But you can scrape it, and if your use falls under Fair Use or another caveat to the copyright laws then you can do ahead and do it without needing any permission from the authors.