Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I thought I knew these things until I worked on a project to scrape arbitrary websites. We had to follow arbitrary links around a site (easy, right?) but it turns out there are many, many edge cases we had to deal with.

Another developer pretty quickly decided to ditch Ruby's URL parser and write our own, since there are tons of things browsers deal with that you wouldn't think of. For example, relative links starting with "//" share only the protocol (http or https) between the current page. Add in vagaries specific to some HTTP servers, like that http://foo/bar == http://foo/bar/ and we quickly realized it was a lot bigger task than we thought.

We ultimately got the thing working OK, but crazy edge cases just kept popping up.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: