Hartley Brody is a full-stack web developer and tech lead with 10 years of experience across many modern tech stacks.
Web scraping can be an extremely powerful tool to have in your arsenal as a growth hacker. Hartley teaches us why we should care about it and how we can utilize it.
TOPIC HARTLEY COVERS
- His background in Web Scraping
- He’s an expert on SEO inbound marketing
- What is Web Scraping
- The difference between Web Scraping and Screen Scraping
- How powerful API is in Web Scraping
- How a startup could use it
- His thoughts on CAPTCHA
- His best advice for any startup is to try to grow
- And a whole lot more
LINKS & RESOURCES
WATCH THE INTERVIEW
READ THE TRANSCRIPTION
Bronson: Welcome to another episode of Growth Hacker TV. I’m Bronson Taylor, and today I have Hartley Brody with us. Hartley, thanks for coming on the program.
Hartley: Great to be here.
Bronson: Yeah, absolutely. Hartley, you began your career on the marketing team at HubSpot, and before that, you built a couple highly visible music websites. So you know something about growth. In fact, I actually came to you for help. You have a clarity profile where you do consulting and growth out of TV came to you and we set up a call and I just picked your brain about, you know, kind of a strategy that we were going to use and we’re systematically putting into practice almost everything you said. So it was very helpful. So thank you for that.
Bronson: I know it was good. So I recommend other people to go check you out and do that as well. That’s actually one of the problems is that you have so many talents. It’s hard to know where to take an interview like this. You know, just to list off some of the things you do. You’re true growth hacker, you market and you write code. You’re an expert on SEO inbound marketing, thus the HubSpot stuff. But you also have another talent that we haven’t really talked about on this show. And so that’s what I want to dove into. And that’s the notorious web scraping. So so let’s spend some time talking about web scraping. So I’m going to you.
Hartley: Yeah, let’s do it.
Bronson: All right, cool. So first, tell everybody what is web scraping? Yeah. So the name is screen scraping. Are those words interchangeable?
Hartley: So, yeah, there’s a lot of different terms. People use screen scraping, data harvesting, stuff like that. I mean, it’s basically I think most of those terms fall under the umbrella of just sort of web scraping. The idea is that it’s essentially a way of pulling information and extracting content from a Web site as opposed to an API or RSS feed or something a little bit more structured. The content of websites is generally less structured than some other data feeds. And so having to parse through that takes a little bit more work, a little more creativity sometimes. But to me, that’s that’s sort of the definition that I have most commonly heard with web scraping is just extract information and data from from a website.
Bronson: Gotcha. So it’s kind of like seeing an actual HTML page as a data feed. We usually don’t we usually just see it as data to be displayed, but not a feed to really harness and manipulate and use where we see APIs as, Oh, they want us to use this data. They put it in a nice, you know, system for us to kind of get at. But you see the web, the whole web as a big data feed, right?
Hartley: That’s right. And a lot of times the reality is for for most websites that people want to extract information from, there isn’t an API, there isn’t a sort of classical API. But by exposing that information via a website, there is essentially a structure document that’s in your face that you can sort of browse through and takes a little bit of poking around to sort of figure out, okay, what are the content elements I need to look for and what? How is this stuff all marked up? But usually things are pretty consistent, pretty straightforward. And once you just sort of, you know, pull up your inspector and kind of click around a little bit, you’ll find exactly you’re looking for pretty quickly.
Bronson: Yeah. And we’ll talk about this more in a minute. But I want to hit on a little bit now. You know, APIs are super structured. H.M. documents are still structured, just not as structured, right? So it’s not like there’s no way to see what’s going on. It’s just it’s less structured.
Hartley: That’s right. Exactly. And we you know, if you mess around with websites or you build them or even just look at other people’s websites, you know, right click in view source. There’s a lot of repetition and you know, people are handcrafting every single day that goes onto their page, they’re saying, oh, we’re going to have some images and we’re going to wrap them all with an image and give it a class of thumbnail or something like that. So it’s really easy to go in and say, Get me all the image tags and have a class of thumbnail. And that’s essentially the same thing as taking a JSON response and saying, okay, give me all the elements in this list. And it’s it’s essentially almost the exact same amount of work. And it’s one of the main benefits of it is that it’s is universally available. Any site that has a has a website has this sort of structured data feeds or loosely structured, but usually, usually fairly structured. Yeah. And that’s yeah. And that’s one of the big benefits of web scraping is just oftentimes it’s the only option you have for, for getting data from someplace. But there are a ton of other benefits too. Like I’ve, there definitely been times where. Companies put out kind of shoddy APIs or they don’t document them very well. And so you’re trying to use it and you’re saying, man, I’m parsing all the parameters that it says, but I’m getting back a weird response sometimes. And like oftentimes APIs get sort of treated as second class citizens where a company will, you know, if it kind of degrades over time or things break or start spitting out errors, it doesn’t get fixed right away. Whereas if their company’s website goes down, that’s sort of an all hands on deck, try to fix it. So oftentimes a website gets gets much more TLC than an API does. Yeah. And so I’ve definitely found that that in terms of just sort of ease of use, sometimes you can actually I find it easier to sometimes scrape stuff from a Web site than to try and wrestle through some poorly written out of date documentation or something that might not even even be there. Yeah.
Bronson: And the API is, you know, they’re not super easy to create. They take time to take resources. So if you’re trying to do something with a smaller company, that might be the only option you have. And one thing you said, you know, it’s really about repeatability as well. It’s about finding kind of the patterns in the code and then using those patterns to do what you want to do. Right.
Hartley: Exactly. And I think the other thing you mentioned about it sort of it is extra work to build an API. And sometimes you read about, you know, the architecture of Netflix or how things work at Facebook and internally they use a lot of APIs themselves. You know where you know myself here. The concept of like a service oriented architecture where a company has a bunch of internal APIs they use that usually isn’t V1, but usually isn’t the first version that a company builds. That’s usually, okay, we’ve got to scale this, you’ve got to grow this. You’ve got dozens or hundreds of engineers. Let’s start to compartmentalize a little bit. If you’re working on a site that may be a mom and pop shop or a startup or something like, very often they don’t have an API out of the box. They’re using, you know, some, some hacked together, maybe they’re using rails or something like that or Django. But very often the API is not the first concern at a, at a sort of smaller company where a lot of the data sometimes of you that you need might be coming from a company that just doesn’t have the resources to start behind building an API essentially.
Bronson: Yeah. And you talked about a company kind of using APIs internally to share their own data. And I could be wrong here, but maybe you’ve heard of this. I think Jeff Bezos at one point told Amazon, you’re not allowed to share data in the company unless it’s in an API format, because he was really going for this compartmentalize efficiency, if you heard that.
Hartley: Yeah. So there’s there’s this whole sort of school of thought that it’s pretty big in the enterprise around service oriented architectures where basically you have stuff that’s really siloed. And rather than having an Amazon user, I think one of the sort of people at the forefront of this where you have some service, like let’s say it’s like your book recommendation algorithm and you’re some other service that’s just like the data warehouse for all the like thumbnails. It’s like those are two logically separate things and then they need to have sort of a concise, uniform standard interface for accessing them. And that basically is an API. It’s exactly what an API is. It’s a standardized interface for accessing data. And so rather than having a bunch of internal code that can, you know, I’m working on a thumbnail team at Amazon, I need to call over to this thing rather than having it be like a function call that only I have special access to. I ping an engineer over there and I say, Hey, could you add this thing to your API? And now I can use it, but now anyone else in the organization can also use it. And third parties, people outside your organization can also use it. So it’s a good way to sort of structure, team structure, the sort of data and pipelines pipeline stuff of a business. But it does take a little bit more upfront work to sort of sort of figure out what your services are and how they’re going to act and what those interfaces look like. So usually you don’t see it too much at smaller companies. Yeah, but definitely a bigger ones.
Bronson: Now for sure. And now, you know, some people watching this or listening to this, they know how powerful APIs are in web scraping. They just get it. They understand what data can do and what data at scale can do. And some people are like, Why are we talking about APIs on growth out of TV? Why are we talking about web scraping on growth of Yeah, so connect the dots for us. Give us some like meat and potato, real world, either possibilities or maybe things that actually happen. But tell us like why or how a startup could use this kind of thinking.
Hartley: Yeah, so I do because of the blog post, which I think will probably be in the link at the bottom and the book that I’ve written, I get a lot of inbound requests. People have different questions about web scraping or want to know if I can help with this and that. And I picked up some some really interesting freelance jobs over the past few months. And this one guy was basically running a business where if you have unclaimed property after a certain amount of time, it gets forfeited to the government and his he was basically running a business out of his out of his home that basically contacted people and said, hey, did you realize that you have unclaimed property and we can help you file for it and do all the paperwork? And, you know, he takes a cut of that for for doing that service, but he had a problem with that. Okay. I’ve got this set up and I’ve been sort of manually going through their site or I wait for them to give me a data dump like once a year, but like they’re posting new stuff all the time and all my competitors are waiting for this sort of annual data dump, but it’d be great if I could really get a jump on and contact these people first and I won’t give out exact numbers. But there were lots and lots of zeros on the number of people in the state of California that have unclaimed property. And so for him, this is like he’s got tons and tons of leads for his business that he wouldn’t have had access to otherwise for people that, you know, could really benefit from his services. They have literally cash sitting around that’s theirs they don’t know about. And he’s trying to get in touch with them. And it’s so many people that there’s just really no way for him to do it himself. And he doesn’t want to wait for the for the government to sort of release this data dump. There’s usually holes in it. It takes them forever to do it. And some years they forget. He was like, look, the data is there on the website. I’m not going to go through all these rows by hand. There’s millions of them. But you know what? If I could just automate it. And so, you know, it was pretty, pretty straightforward. But scraping is just said, hey, you know, these pages have a pretty consistent format. Some of them are a little different depending on what kind of property it was that the class and IDs and the tables might change a little bit. But, you know, we went through one or two iterations and I found like, oh, sometimes they split the address onto two or three lines. Sometimes they put on. One line. So you sort of manage the data a little bit. Okay. Here I’ve got indexes. That’s pretty format, a format that’s pretty consistent. And then now I can just dump it into a CSB and send it to him and he can put it in his, you know, his mailing system. And, you know, he’s gotten tons and tons of business from this. So it’s definitely that’s that’s probably my my favorite example of basically getting to to bring lots of new leads and customers in this business, basically because we were able to to automatically sort of harvest this information from a website when there really was no other way to get it.
Bronson: Yeah. And so lead generation is huge for this kind of stuff. So that’s that’s one application for sure. And that’s an important one because that’s where a lot of people make money. Yeah. So let me ask you this. In your experience, you know, that’s an example where obviously he’s going make a lot of money by doing it that way. In your experience, web scraping, is it a high growth channel kind of thing or is it like a fringe thing you can do but it’s going to be marginal growth. Like where does it fall and is that even an accurate question to ask? It’s just yeah.
Hartley: It’s it’s tough. Here’s what I’d say. I’d say that I’d say that it’s it’s web scraping is a tool, just like a sledgehammer is a tool. Right. And sometimes, you know, if you’re building some kinds of buildings, a sledgehammer might be really useful because you might have to, like knock out a bunch of old stuff or something like that. But sometimes if you’re building like a tree house, you might not need a sledgehammer, right? So it’s it depends on it depends on what you’re trying to do. Essentially, you know, for lead generation, it can be it can be super useful, super valuable for a company that’s more B to C, I have somewhat of a tougher time imagining it being used as a growth tactic. The other place I’ve kind of seen it used is I think Feedly does a bit of it. They don’t use it for acquisition as far as I can tell. But you know, when when they are showing me content, it’s content that they’ve pulled from somewhere. Most of the time I think they’re pulling it from an RSS feed, which is like sort of halfway between a website and an API and it’s like a little bit more structure their website, but not like those different sort of specifications and things like that. But I think they also pull content from, from site and certainly those like read it later sites where you put in a URL and it goes and pulls out some of the content and it ignores the ads and it ignores the sidebar and things like that. Know that’s all web scraping. It’s not being used for data acquisition or chopped up being used for user acquisition, it’s being used for data acquisition. But yeah, it’s, it’s a tool to have in your arsenal that and these sites for some businesses and for for others it might not fit as well in their strategy.
Bronson: But yeah, I like those examples because it’s like now it’s a whole different category of things they could be used for. So if I’m on safari and some websites it tells like, Oh, this date is structured well enough. If I had the little reader button in the URL bar, it takes away all the images, or at least the ones I don’t want to see. And it just shows me nice formatted text that’s very readable with no anything else. But it’s not on the person’s website that’s safari screen scraping, right?
Hartley: Yeah, basically. And there’s actually a bunch of different algorithms. There’s sort of a friendly competition between a bunch of different groups that are working on algorithms that basically say, Give me the messiest HTML document that you can, and I will try and pick out only the important parts and throw away all the stuff. And it’s hard. I mean, depending on how heinous a size markup is like, it can be really hard for a program to tell what’s the main content and what isn’t. Most of the scraping that I’ve done has been around. We have a a discrete number of pages where I can say, I know exactly how each of these pages is going to be structured. For example, in the in the unclaimed property records, they all have the same table. They all were there. There are a few differences in the rows and columns, but it wasn’t like I was scraping that site and then 100 other sites that all use totally different stuff and some of them might use different tables. And about certain things to do things like that, you have to sort of take more guesses and it’s not as exact in terms of where you actually can get the content that you want. So there are these algorithms that have sort of been proven out over time that have been worked on by a lot of different people. I’ll just sort of say, okay, how can we. You know, figure out, you know, where the let’s look for some paragraph tags. Are there P tags on the page. Okay, let’s decide whether there’s a, you know, whether this feels like it’s a body element. And so they both get really complicated. And to do to do screening across dozens of different domains is usually pretty tricky. Most of the Web scripting I do is like you have a site that’s basically you had in my target up there. Exactly. And you know what? There you can look at it, you click around. No, it’s going to be fairly consistent and then you just run it. And if there’s an exception or an error, you say, Oh, what is this weird case? Oh, I see. There was one time where I was pulling a bunch of data from some fashion retailer sites. I think it might have been Nordstrom’s or something like that. And the they had a description of the item underneath the price and the add to cart button and everything like that. And there were like only two or three products in the entire catalog of products where instead of it just being a paragraph tag, it was like a bulleted list of things. And the first time I hit it, it was like, Well, what is this? Like, this isn’t and you know, this isn’t just text. There’s other stuff in here. But then you hit the error and you’re like, Oh, I guess this is a weird education. These are designed for a go. Yeah. So that’s I think that’s the majority of web scraping is sort of tailored to, to a specific site. You have something in mind and you can sort of hone in on what the content you want is. But there is all this there is also this other school of thought around doing it more algorithmically where you say, okay, we might have to pull content from hundreds or thousands of sites. There’s no way that we’re going to be able to come up with all the different hooks that we need to find the content. We have to do it in a more automated fashion. Yeah. And that and I definitely get gets a lot trickier to sort of do that.
Bronson: Well yeah, that’s a great point. I never really thought about it. Like I guess I know it because of the safari example I gave is kind of the broad general algorithm they’re looking at, you know, every thought you can throw at it almost and say you can do it. But, you know, the the land list that you gave is kind of a specific example. So I’ve seen both and I understand both, but I never really thought like, Oh, that’s two totally different like ways of approaching it. What you do with the one is not the way you would do the other. It’s just a different kind of thing.
Hartley: Right, exactly. It’s I mean, it’s it’s the same high level goal of pulling content from a site. It’s just the sort of how much do you know ahead of time? It really depends. If if you know exactly what the site’s going to look like with a reasonable certainty, then you can hone exactly in on the data that you want. And it’s basically just like saying, Oh, I have this site’s API. I know exactly how it’s going to come back, what structures, you know, like I’m just going to load it as a dictionary or a hash table and plot these keys. But sometimes you run into a thing where you’re like, I don’t really know what this is going to look like. Ah, this JSON is going to look like an and you have to use different sort of tactics, but it’s the same high level just getting the content from.
Bronson: Like with the general algorithm. I mean you’re just building in a lot of extra logic. If this happens, do this. If that doesn’t happen, try this. If these both have and make sure you do that, it’s just like because you have to there’s so many edge cases because almost everything is an edge case now.
Hartley: Right? Exactly. Yeah. And honestly, most like I said, most of the stuff that I’ve seen is sort of the more straightforward. Like someone has maybe two or three sites, maybe one site, maybe a dozen sites, and you just sort of go through them one at a time and you say, okay, here’s how this site does it. Here’s how this site does it. And that way, you know your exact you know exactly what you’re getting. You know it’s going to work with reasonable certainty because a lot of times if you run, if you take like a random web page or write through one of these, give me a summary algorithm, you know, with Safari. But sometimes it might like cut off a bottom paragraph that thinks that it’s like part of an ad, or it might include the sidebar when it really is. And it’s just like, doesn’t make any sense. Like, why is this in here? And so it doesn’t always get it right. But if you’re willing to sacrifice a little bit of the quality and the tradeoff then is that you get more flexibility with your scraper, then that’s definitely a way to go. Most of the most of the applications that I use, web scraping for the they needed the data exactly the way that they wanted it. And so they wanted to sort of go in sort of by hand and pick out all of the dibs and spans and stuff like that. The content information that.
Bronson: Usually you come to web scraping with a goal in mind. I want this kind of data from this exact site so that I can do this, you know, thing with it. It’s not I want to, you know, web scrape for fun and profit. You know, it’s that goal in mind. Let me ask you this. Walk me through the process. Kind of, you know, what you do when a client says, hey, I want you to screen strips and stuff for me is the first step to kind of just go to the website, inspect element and just see if the website even has the data that you want. Is that kind of step one?
Hartley: Yeah. I mean, yes, everyone is is basically saying, okay, I’m using a website as a data feed in the same way that you might say, okay, I’m looking for information. Where do I find it in the site’s API if you’re deciding to go the web scraping or. I would say, where do I find this information on their Web site.
Bronson: So that’s the API docs. You’re looking through their source code?
Hartley: Yeah, it’s it’s basically like their website is a combination of the data you want.
Bronson: And anyways.
Hartley: It’s. Yeah. Well hopefully, hopefully. I mean if it’s not then then was going on might not work for you if you can’t get it to their site. But the nice thing about a website is it’s basically like the documentation and the API all in one. It’s like, Oh, I can look around and I’m like, Oh yeah, I’m, I’m looking at all men’s wear. I’m J.Crew. Oh, I just want sweaters. I can click right there. I don’t have to go read something about collections versus categories versus I think it’s just make that.
Bronson: You’re in the code looking at how they title.
Bronson: Just text and it would.
Hartley: Send a request to like slash timeline, JSON, they sent some information about a cookie that was like who was logged in and maybe what the time offset was. And that went over and Twitter did all its work at their database and all their services. And what did the response come back? It wasn’t anything crazy. It was just some more text that came back. You can read it and you can pass it. And so when you think once you sort of make that realization, all the rest of those complexities kind of fall away. And you say, okay, at some level, I just need to build, basically build, build a string and send that string over the wire and just get a string back and pass it the right way.
Bronson: Whether it went into that.
Hartley: Yeah, sorry.
Bronson: I mean, no, that’s awesome. I mean, you know, you think about the Internet, it’s been polished over the years. It’s never been overhauled. Like it’s still just text the way it was in the early nineties. Just text.
Hartley: Well, the beauty of it is that’s all you really need. I mean, it’s, it’s, it’s simple. And that in that it works very well and that, you know, here we are decades later, we’re building these complex applications on it. But it’s still the exact same building blocks that you’ve been using like since way back when. It was just a few universities connected over some, you know, fiber optic cables across to.
Bronson: Now, I think. Right. Though, you know, we get thrown off because it looks so polished and so complicated and we’ve got Ajax going on and all this stuff. But it’s like, you know, at the root machine codes, just ones and zeros, you know, just text like images. We think of images as in text. No, it’s just ones and zeros.
Hartley: And graphic images.
Bronson: Yeah, it’s just code.
Bronson: Yeah, no, it’s great. Everything you just said totally demystifies it, which is what we want. So that’s awesome. So you went to the Web site, you view the source code. They have the data that you want. Is that when you start writing code or is there any other step after that that we’re missing?
Hartley: So usually I want to I want to use I want to confirm that I can sort of that I can make programmatic requests and see the same stuff that I saw my browser. And at first I used to not do that. I used to be like, Oh, I saw my browser, let me jump right into coding. And I get hung up on these errors where I spend hours and hours trying to parse the response. And I was like, Why is this? It doesn’t have this do? And I thought I was going to have. And then I discovered that, oh, this, this site was like it required there to be like this weird header that they had set and my browser was sending it, but I wasn’t sending it or I was doing it programmatically. So the response I was getting back was just an error message instead of the actual content of the page. So I use a tool called I think it’s called like rest clients or something like that. You can search for a bunch of them basically. It’s basically a visual sort of graphical user interface on top of kernel, which is sort of the command line version of making network requests. And it lets you basically say, here’s a URL, here’s some headers, send this request over the network and just give me all of the raw information that comes back.
Bronson: So some of them actually write the code and put it in a file and executed. You can just drop it in this thing and see what it pulls back.
Hartley: Yeah. And actually, I think the thing I use, it’s called Postman It’s a Chrome. And you see chrome essentially. I think now it’s a chrome app and it basically lets you save things and it gives it a really nice, easy thing. You can say, Oh, I want to send a post or class or less and get requests or these other kinds. And you can set, say, post data and things like that, form data and it automatically takes care of again at some level just text going over the wire, but it gives you this interface unless you sort of click and select things and then it turns into the text and shuts the wire for you and you can say, Oh, interesting. The response I’m getting back from this site is different than the one that I saw when I was loading in my browser. Let me check. Oh, looks like when I was doing it in Chrome, Chrome was sending this other header that I forgot to send. Let me add that. Okay, now it’s working.
Bronson: So I suggest.
Hartley: Yeah, basically you sort of go from having it work when you just click on things in your browser. The next step is okay, can I reliably build my own HTP request by hand and get the same response? And then once I have said, okay, I know what I need to get this HTP response, then you can sort of demand the code and say, All right, let’s start, let’s start scripting some of this stuff and writing it down in codes that I can can automate these requests and stuff like that.
Bronson: So then you start writing code and I know in your book you actually have code samples, so people should definitely go check out the bug, buy it. I think it’s about ten bucks, right? Yeah. I mean, it’s nothing. So go buy it. It’s you know, there’s some California making a lot more than $10 off this stuff. Exactly. But it’s property. So you start writing code like without getting into like literally what the lines of code say, what is the code doing? Is it just saying, hey, look for this, you know, this header tag and then pull out the data between the beginning and the end of the H1? Like, is that the kind of thing is doing to looking for certain tags and stuff out?
Hartley: And actually usually the first thing it does because usually if you only have one web page, you’re probably not using a scraper because you can just go to copy and paste it yourself. Usually we’re using a scraper and there’s a ton of things, so usually the first one is hiding from pages in your scraper. So usually the first Java scraper is to be able to enumerate all the URLs. And that’s and that’s sort of can be tricky. Sometimes things are super well formatted and you know, you have a site where it’s like slash product slash idea and then there’s a number and you can just increment the number by one. And you just got the next product and it’s super nice like that. Sometimes it’s not that straightforward and you have to actually do what I would say gets into the realm of more web crawling or web spidering, where basically you’re doing the same thing, except you don’t know what all your URLs are going to be ahead of time or it’s not easy to discover them. A good example of this might be if you’re trying to get products on Amazon and most of the products have some text in the URL that’s like the title of the product or some kind of description or something like that. So you don’t really know. Let’s say you’re trying to get all the light bulbs on Amazon. You don’t really know until you actually view a page with all the light bulbs on it. You say, okay, here are the light bulbs. Now let me go see each of those links and find those. Now I’m going to go scrape those sites and you have to sort of keep track of what sites you’ve already scraped. And spidering gets a little bit more complicated.
Bronson: Exponentially so.
Hartley: Yeah. And especially you have to worry about, you know, not visiting, saying the same URLs over and over again and getting into infinite loops where if you just click on I did that one time, I wrote a super, super bad web spider, basically. Okay, you find a list of every link on the page and then go, is it the first link? And then do that over and over again recursively. The first like.
Bronson: Was the page, wasn’t it?
Hartley: And it basically it would just go between these two pages back and forth and it never got to the other one. So you have to get to think and it’s a pretty good computer science, kind of classic computer science problem about like you use recursion or looping or how do you keep track of states. It gets especially complicated nowadays when if you’re scraping, you know, hundreds of thousands or millions of pages and you have a distributed crawler where you have a bunch of different machines that are making the requests, and you have to sort of have a central place where keeping the state of where you are, what pages you hit, and what are sort of next up. And, you know, it gets even more complicated if you do things like. You know, I don’t want to slam this site and send too many requests out at all at once. So I got to keep track of the last time that I hit a URL on the site. And then whenever I’m about to hit in the next few hours and check to see if I already do so, it can get pretty complicated pretty quickly. Yeah. But usually the first thing that I’m doing is, is just trying to get to the next page or sort of enumerate the URLs that you just scrape. And then so once you once, you know, each application has a different it’s hard to sort of have a one size fits all answer for that, but at a high level sort of that you’re running all the URLs, then it’s just sending the request, each of the URLs in need and then it’s just so great. We got some text back from this to requests lets you know we know where these things are that we need here, the title or the description and stuff like that. Usually I use I do a most myself in Python. I have used beautiful soup a lot. When my coworkers is trying to convince me that Alex smells a tree is better. There’s sort of a holy war going on in terms of which is better at parsing HTML. I have been learning it. I find beautiful a little bit easier, but those are two great python packages. There’s no Gary and Ruby. You want to stay away from XML packages? Sometimes we hear about it. People say, Oh yeah, we can, you know, we can read XML documents, but generally most externals basically. Super quick detour is another specification of how text should look like. The same is an RSS and an API and has certain rules about nesting and tags of the clothes and there’s namespace and stuff like that. Most each HTML encounter on the web is is not XML compliant. Looks similar to XML. If you look at it like, oh, this looks like HTP or it looks like HTML is not most HTML pages are not about X. And so if you try and say, oh, I’ve got this, I’ve got this giant string of text and it’s got pointy brackets, let me put it in my XML parser. It’s going to be like, this is not valid. Yeah. So the HTML parsing is a little bit more tricky because it’s kind of more free form and people don’t always mess things properly. They forget to close tags and stuff like that so it can be in big you sometimes. What the what the markup of a site actually looks like. You can imagine like if I have a paragraph tag and then I open that and then inside I have an italics tag or let’s say an emphasis tag and I have some tags and then I close the paragraph tag slash P and then I close the emphasize tag slash.
Bronson: Yeah, I’m out of order now.
Hartley: It’s like ambiguous. It’s like, well, is the paragraph inside the emphasis? Taggers have to stay inside the paragraph text and this stuff is all over the web. Most websites are not actually valid even in HTML. If you put them in an HTML validator service, they’re usually not. So the libraries that you use to sort of parse this text and to give it some structure have to do a bit of guesswork and sort of eliminate some of that ambiguity.
Bronson: Be pretty powerful and not just XML.
Hartley: Right? Yeah. And some of them are more are more full featured and they do. They do actually. Now they do. HTML likes. HTML is the one I was mentioning earlier that I’m learning in Python has a separate sort of HTML parsing thing. They also have the XML one. Yeah. Actually my stuff is usually pretty fast. It’s so well-formed but targeted at performance. I usually don’t worry about the performance of my scraping code. And here’s why. It’s important to keep things in perspective and keep an eye on sort of orders of magnitude of latency and how long things are taking your application. And you hear people talk about I actually I saw a Y Combinator backed I won’t say who they were, but they they put out a blog post basically saying how they were moving from one. They do much web scraping and they’re moving from one passing library to another. And how the reason was because the old one wasn’t fast enough unless they have some really like they’re on like a Google Google Fiber network and they’re like super close to the thing. Most of your slowness is going to be the network and it’s going to take two or three orders of magnitude more time for the request to go over the network, get to the faraway server, and then come back to your computer that if you’re if you’re, you know, squabbling over, you know, one or two CPU cycles here and there to do faster XML processing, you’re probably focused on the wrong thing. Yeah. And so yeah.
Bronson: I was just saying it seems like, you know, the target would have to be a huge target for scraping for speed to matter because like your guy in California, there’s a very finite number of listings every day, every year. Like it doesn’t matter if you speed it up. It’s done pretty quick.
Hartley: Exactly. Exactly.
Bronson: And that’s what a lot of screens really is. I think, you know, that people are actually using day to day people watching this show. They’re not scraping Google. Yeah. So.
Hartley: Yeah, it’s there’s usually that’s the thing is there’s if there’s a finite amount of stuff. Most of the screaming jobs I have, they run overnight. I say, you know what? This might take? This may take 4 hours. It might take 6 hours. I don’t really know. I’m going to set it up and make sure it works. And then when I wake up in the morning, I’m just going to check on it and make sure it’s still there. But I don’t even really you know, it’s definitely not a real time system. I’m not I would never advocate for building a wall like a news feed where like when someone logs in, you go out and do some. What’s great about that is a horrible like it’s all you only use it sort of for offline data purposes. Sometimes you might do some scraping my poll results of it into a database and then you might build like a website running on top of that database to make searching it fast. But, you know, there’s so many problems. What if the site you’re scraping goes down or changes their URLs or changes their markup and something breaks? You never want to have sort of live scraping happening in front of a user. So all my stuff, all the jobs I’ve ever done of all that sort of offline kind of bad stuff where let them run overnight, they take four to 6 to 8 hours, sometimes even a few days. Yeah. And then, you know, you’re not really that worried about performance at the sort of like CPU level. You might be worried about it more at the high level one where you’re saying, okay, how can I reduce the number of URLs I scrape, let me do some duping or something like that. I’m like, Oh, I only have the same URL twice already. Hit it like that. That’s a great performance improvement because now you’re not seeing the same URL over and over again and wasting time doing that. That’s great. Talking about which parser is faster for your HTML is probably like the gains you’re going to realize are going to be so infinitesimally much smaller than you know at a time that your application takes it, you’re probably not going to notice it. Yeah.
Bronson: So once you actually you scrape the data so you, you know, you sent the request, you, you get the data back that you want. Your, your code is found what it needs then you stored in a database. Is that the next step and then or.
Hartley: Yeah, I mean it depends where the, what the sort of the reasons are for it. Like for a lot of my web scraping clients that I do work for, they are non technical, they don’t know SQL or anything like that. So I, when I’m doing the scraping, I put it in a database, but then I have a script that reads out of the database, write it to a CSP. And then I just, you know, with the CSP and they can pull into Excel or whatever. I mean it depends on sort of what it’s, you know, what they need it for and what they’re using it for with the sort of end user of that data it is. Then when it’s a non-technical person find CSP is usually the easiest or it’s easy to convert that to like Excel or something.
Bronson: Yeah, they can import that into their, you know, customer relationship stuff that they have. Yeah. What’s going to do with it or are they going to open it up and do it manually?
Hartley: Whatever is out now.
Bronson: Web scraping can sometimes get a bad name. Is it legitimate to do this? Like, is it legal? Is there anything we should be worried about? I mean, because that’s that’s a question that has to come up a lot, right?
Hartley: Yeah. It’s definitely something when people hear web scraping. The word scraping kind of has negative connotations, which just like English language in general, it sort of implies like vandalism or something like that. I think. I think to go back to the the metadata format earlier with the sledgehammer, it’s a tool and you can use it to, you know, to knock down your own property when you when you need to knock something down or you can use it to go break in your neighbor’s house. You know, one of those is legal. One of those isn’t. So it really just depends. And it’s I think it’s unfair to to say and it’s actually impossible to say at a high level whether scraping is is uniformly legal or uniformly illegal. I would never say that it’s always legal, but I also would say that it’s not always illegal. It just depends on what you’re doing with the data and how you access it. One of the sort of more more famous or infamous, depending on your stats sort of web scraping cases recently is we’ve he was someone who basically he was served in a trial for a while. I don’t like him, but he basically discovered that on the AT&T website. I forget exactly what it was. I might have an email address or just even a user ID if you went to like AT&T dot com slash iPad slash user slash and you put in like a user ID to say would just spit back like all this information about who that user was. And if you just incremented the number by one, then you’ve got the next one next one. And he wrote a script that harvested like hundreds of thousands or even millions of these things. And he didn’t really practice responsible disclosure. He kind of tried to blackmail AT&T, I think, a little bit with it. I don’t know all the details of the case, but he’s, I think, currently sitting in a jail cell right now. So he didn’t come out too well from that whole thing. But, um, I think it was mostly because of his antagonistic tendencies. But I mean, certainly if you discovered something like that and wrote a script and contacted the site and say, Hey, here’s a proof of concept I just did, here’s a security vulnerability, I’m not going to release any information about this to the public. I’m going to give you guys a chance to fix it. Like there’s definitely, you know, ethical scraping and things like that. And a lot of times, like sites just don’t care. Like they know that they’re being scraped and, you know, they, they, they don’t feel like having an API. They know that people really, really want their data and they’re just like, whatever. We’re not going to, you know, we’re not going to read them and you’re not going to do anything. Just kind of have at it. Yeah. And, you know, they’re not obviously not going to to get mad. Another case I heard about was carrying on pad mapper was basically their whole site was basically a Google Maps overlaid with like Craigslist and other sort of apartment data. And Craigslist turned around and said, you know what, actually, we don’t want you scraping our data because we have an API. So we had never was scraping the data crisis that we don’t want you doing that anymore. And, and Pedro was like, okay and Craigslist like blocked their IP address from scraping but then had never when I went ahead and said, you know what, actually we’re going to like generate a new IP address, we’re and keep doing it anyways. And once they did that, now there’s actually a I think there’s a court case still going on where a judge has said, well, okay, now you were actively circumventing this site, trying to put up a wall to prevent your behavior that constitutes breach of a computer system or something like that. The law is still you know, these systems are so new. There hasn’t been a ton of case law yet. So it’s hard to say specifically what is and isn’t legal yet. But there have been a few high profile cases that I’ve been sort of watching to see how things come out. And honestly, a lot of it comes down to very, very specific facts of the case. So it’s really hard to say, you know, whether as a tool, it’s it’s always a good thing or always a bad thing. It really just sort of depends. It depends if the site’s trying to throw out blockers and you’re, you know, trying to actively get around those. Yeah, well, that’s good. Now you’re opening yourself up. Opening yourself up a little bit.
Bronson: Yeah, well, that’s a good point. When it comes to security in the law, it’s not how good is the security and can I circumvent it easily? It’s did I try to circumvent it at all? That’s a lot of times it’s not that. Oh, all the passwords in the HTML, the fact they had a password on it means you did something to get through, you know?
Hartley: Exactly. And I mean, and who’s to say, you know, in the weave case where he was, who’s to say that he didn’t just find those links in Google and he search for AT&T iPad. It just clicked on them and said, hey, here’s a valid link. You know, the fact that he, like, wrote a script and enumerated it, it had a sort of hacker persona. And like just a lot of people, I think all those sort of intangibles count against him the technology of saying, here’s, here’s a URL. If I visited a lot, something weird happens like it’s it’s ambiguous whether or not that constitutes a breach of the law. But it’s sort of the higher level like is the site actively trying to keep me off and I’m trying to get around it. Is there there’s some sort of security system that they put in place? You know, you’d have Technologies Week and I’m saying like, wow, look at these guys. They left a giant hole in there in their fence. It’s like the this isn’t web scraping related. But there was. There was I don’t know. You heard about this. There’s a big bug on, I think, United’s website a few months ago where, if you like, pulled up two tabs and you like search for flight over on this one, you search for flight over and this one and then you like close them and like a specific order you like and refresh one. I would say all of a sudden all your flights were free and people were booking like free. Like there’s some folks with some bug in their booking system. And apparently it happened once before and they said, you know what, I actually will honor all these things. We we think it’s great that people are finding these bugs and feel bad about it and it’s all your fault. And we give it to you, let it happen again. And I think Mashable wrote about it and a bunch of other people covered it, and that’s how they put their foot down. They said, No, no, no, we’re not going to honor these like you guys. I clearly had found a bug in our thing. You were clearly people were giving instructions to each other about how to like get at the exact sequence of clicks and things to get it to happen. And they said, now this is obviously you guys went out of your way to sneak through a hole in our fence. We’re not going to honor this. And I think they did I think they did a good job handling. I think both in the first case when it was sort of my fault, it didn’t get out of hand. They dealt with the. Quickly. And then the second case, we’re really exploiting it. And it was going to cost them a lot of money. And, you know, it’s like saying, oh, I put it, you know, they put a fence around the business, but I was able to climb over it. So I should be able to just steal whatever I want. Right. It’s like, well, it’s not really how I work that way. That really works well.
Bronson: At least for the next thing. If somebody like Craigslist doesn’t want their site scraped, are there things they can do technically to just make it that, hey, this site is undescribable? Or is that not really a possibility given the it’s just text thing that we talked about?
Hartley: Yeah, I was going to I was going to get back to that. And it’s I it’s an interesting question. And I there was a core question, I think that was like, how do you prevent your site for being scraped? And I sort of listed a few things, but at a high level, if someone’s determined enough and you have information that’s available on your website, it will always be scraped. Any information that you display on your website can be scraped, like full stop. Mm hmm. And people will do things where they say, okay, I’ve had too many requests from, you know, one IP address. It’s probably a robot, so I’m going to block it. Okay, you can do that. But now, you know, if I’m determined enough, I could spend I can have, you know, hundreds of different things that are all hitting your at the same time. You know, people try and use like captures and things like that. That’s a pretty viable solution. But anything to do.
Bronson: With the CAPTCHA.
Hartley: Right, exactly. You say, oh, a human can can read this and put in the information and or has a computer and have a much more difficult time trying to access that. The problem with those things is that they also ruin usability for your average user and users. You know, while CAPTCHAs are fantastic at stopping bots, they’re also fantastic for pissing off your users.
Bronson: Sometimes stopping users who came and read them happens. Exactly.
Hartley: You know, you look at studies where, you know, sign up forms have a CAPTCHA and you look at the conversion rates before and after that of CAPTCHA form. And it’s like an order of magnitude fewer people will fill it out. Yeah, I, I’m sure you’ve had this where you sit there and you try to flood the CAPTCHA three or four times. This is incorrect. Incorrect. Yeah.
Bronson: You give them a smart card if I can’t fill this out like this doesn’t work.
Hartley: Exactly. So, you know things there are definitely things you can do. Another thing is, is putting your information in some sort of like in a PDF or in an image or something like that, which makes it a lot more difficult to parse. You know, if you have one guy actually just sent me an email about an hour before this and he was trying to scrape some grocery information and they were basically rather than posting on a website, they were releasing these like fliers as PDFs where it’s just a giant image and they had the products and the prices and stuff. But, you know, it’s much, much more difficult to do that. So it definitely sort of throws a wrench at any Web scraper it might be trying to get at it. But, you know, there are optical recognition systems. There are all sorts of ways of doing stuff like that. Yeah, I actually know someone back in the back in the early days of Facebook. Now, if you go on Facebook and you see an email address, they don’t they actually very rarely post your email address unless you explicitly tell them you want an email address shown for a while. What they do is you take your email address and they would generate a little image. And so it would be an image of the world of your thing. I had a friend who went to MIT back in the mid 2000 and they would basically where they whenever they had parties at his frat, they would go straight to Facebook and they would invite tens of thousands of people via email. And they were actually pulling down the email address. Then they were doing some optical character recognition on it to be able to figure out what the letters were. And they actually got an email from. It wasn’t Zuckerberg. It was it was one of the other early employees. I forget his name, but he was like, Look, we know you guys are doing this. We’re working on something to stop it, but just don’t be dicks. Can you just stop doing it like, please? And it was like, you know, Facebook was actively trying to keep them out and they were finding new ways to get around it. So yeah, yeah.
Bronson: That’s a fun example. And, you know, gets into the, you know, being a good web scraping citizen, you know, as you call it in your book. I mean, is that basically come down to, you know, obey the wishes of the site, don’t send too many requests in a short period of time. Don’t use the information you get for nefarious purposes. I mean, yeah, that’s basically the way to look at it, to be a good citizen.
Hartley: Yeah. So there’s some comments and stuff like that. You know, this is a general like, you know, don’t be don’t be a dick about it. But there are actually some more formal standards that that a lot of websites try to use to sort of talk to, to to scrapers. I mean, one of the most common the most common scraper in the world is the Google bot. It goes around and visits every single page on the Internet, you know, every few hours or days or something like that. So it’s all around. So there’s actually there’s a standard called the Robots Exclusion Protocol where basically you put up a file. So it’s your domain slash robots.txt and if you Google it there, you can read about what the spec looks like, but you can basically tell a robot, you say, okay, here are the directories that you’re allowed to scrape, but here are the ones I don’t want you to scrape. And so if you’re building any kind of scraper or web crawlers, something like that, and a lot of the open source ones that you find, the open source ones. Automatically, most of them will say, okay, before we visit any page of this domain, let’s look at the robots and text file. Let’s make sure that we’re not violating the sort of webmasters guidelines about what they do and don’t want scraped. So there’s that. Another best practice is to sort of throttle your requests. So you basically time when you send a request for how long it takes for it to come back and you delay your next request, at least ten X, whatever that amount is. So if the site comes back and it’s super fast and comes back in half a second, okay, you wait 5 seconds, send that request and the site starts to slow down. It’s taken five, 10 seconds to serve request. It’s like you’re probably hammering it. You really need. And it was also just like a web server can only do so much at a time. You know, waiting ten x what the time it takes you to get your response means that you’re letting ten other people in that time have unfettered access to the web server and then you’re getting back in line and doing it again. So it’s sort of a it’s sort of a good rule of thumb in terms of not it not overwhelming a web server, not taking too many resources.
Bronson: Yeah. You don’t have like a denial of service from scraping.
Hartley: Yeah. And it’s and it’s really easy. I’ve written scripts that have bugs and didn’t do things properly, and then you end up sending dozens or hundreds of requests in a very, very short period seconds. And you can you can actually knock sites off line, which is super counterproductive because I can’t get any information. You can’t do your scraping. So, you know, even just selfishly, you want to make sure that you’re slow enough that the web server has time to respond to this so far. But also, just as a good citizen on net, you want to make sure that you’re letting other people have access to that site as well. And that’s.
Bronson: Great. That’s a good way to look at it. Now, tell us again, what’s the Web site where we can get the book to actually go into more detail and get some actual code to do this?
Hartley: Yeah. So the book is actually on the blog if it should be in the links below. My site is just blogged out. Hartley Brody dot com. I think I have a link in the sidebar too to by the book. It’s on my site. It’s only ten bucks, actually. You know what? I’m going to add a coupon code. Yeah, we’re going to do this right after. Right now, if you just put in growth, I will. I mean, after this, go set up a coupon code for 20% off. So it’ll be like eight bucks or something like that. So that’ll be something.
Bronson: That is reason a market right there. Take that and use a promo code, incentivize people know it.
Hartley: Works because now I can tell where the purchases are coming from, if they can track.
Bronson: It. I mean, it’s just it’s the way the world works, you know. So that’s also that’s great.
Hartley: But yeah, I definitely I basically wrote it with the expectation that, you know, almost nothing about anything, which I don’t mean that that’s I’m condescending. I mean for it to be like, let’s go back to the basics. Let’s talk about how the Internet works, how it works. And then we start to build on those things and look at, you know, here are some here are some common pitfalls here and things to watch out for. Here’s how to follow these sort of best practices about being good web scraping and things like that.
Bronson: Yeah, that’s awesome. I would definitely recommend everybody to, you know, pick it up and read it. Now, see this kind of as a final question. This has been awesome. People are going to love it, but you do so much related to growth. I mean, scraping is just one little piece of really what you do and probably not even the main piece, really. So when you look at startups right now, what’s the best advice that you have for any startup is trying to grow? It may be scraping related, but it probably won’t be the.
Hartley: Yeah, no, I would say it’s not, I would say and this is one thing I’ve learned pretty recently is that find, you know, pay attention to new platforms that users are using when, when Facebook first had a news feed and they opened up their platform API. I mean, you remember those apps where it was like invite your friends like Mafia Wars and stuff like that. Like, you know, companies exploded on the backs of that because it was like, wow, like, look, all these ways we can reach new users. But once it’s been out in the market for a while, users sort of get more numb to it. The platform usually clamps down, so you can’t be that spammy, but there are tons and tons of these platforms people use that aren’t just the traditional social media sites and search engines that you can use for all your stuff. I was talking to a friend who does growth at an online sort of place where people can post online class and things like that. And he was saying that they they used to have you have one app for their whole company and you know, you would go to the app store you to search for the name of the app. They’re learning company and you get their app and they realize, you know, people are actually we have courses on guitar lessons, we have courses on golf and programing and all these things. We should really make apps for each one of these individual classes. And so they sort of had this like cookie cutter approach. They said, okay, each instructor, when they make a class, gets to set up their own app and we’ll put it in the iOS store for them and they sell signups like Explode because all of a sudden people who are going to the App Store and maybe searching for, you know, I’m on the golf course and I, you know, I need to work on this. Let me just if there’s any good apps, there’s like only a handful of golf apps if you try to rank for golf and. On Google, there’s hundreds or thousands of sites you’re competing against, but there are a lot of these new platforms. Even some of the big ones like the iOS store, the Chrome App Store, you know, Android store, there’s a lot of these stores where, you know, your product doesn’t necessarily lend itself to that. There are ways that you can kind of do user acquisition through that. I actively promote my book through my website, but I also use Lane Pub as because they help with like the formatting and converting the markdown to like a PDF. It’s a great service how to recommend it. But I also have like a storefront and I put it like exactly zero effort into my storefront on their site. I should probably do a better job, but I get random sales people who I would never go to think to buy a book online. But there are a lot of people that do or they bought, you know, they heard of some other book that was there. So they look around and they discover my site. There’s all these other platforms that people are doing searches and looking for content. Don’t underestimate the value of some really niche thing. You know, if you think users are on and they’re searching for things that are related to your product, like get on it, be one of the first people on it and you’ll often see huge. You are able to ride that often, you know, by the time they say like once there’s already a case study written on an industry or tactics to get in on it. So it’s like you really want to find these things.
Bronson: You want to be the case study.
Hartley: Exactly what it will be because that’s going to be a thing. Exactly. And I and I think that there’s I think the online learning class, I think is going to be one of the ones that really is at the forefront of doing this sort of cookie cutter app stuff on there. They have to make sure that they toe the line with Apple’s guidelines because Apple has some sort of, you know, don’t be spammy. So they have to sort of, you know, but they’re sort of on the edge of that, figuring out, okay, we’re not trying to be spammy, but we’re trying to like, yeah, you know, get a lot of users and it could.
Bronson: Be good for Apple’s people as well. It could be good for the users to find the app actually wanted when there wasn’t a lot of them and they weren’t just putting random keywords for like SEO stuffing purposes. That’s what the app really is.
Hartley: Yeah, right. Exactly. Exactly. And usually by the time the spammers are in there, it’s too late. So you want to be the first the first spammer, quote unquote, you know, hopefully providing value or you’re not tricking users. But, you know, any startup that you know, you’re passionate about, there’s there’s got to be ways that you can sort of find these new platforms that users are using to look for things and discover things and get on and just try to get on them.
Bronson: Yeah, that’s great. You know, the difference between being a spammer and being somebody who’s awesome is just value. Yeah, you could provide value at scale and it’s awesome. You provide, you know, not value at scale. It’s called spam. Yeah, that’s I mean, just provide value. And if things work themselves out well. This has been a great interview. We’ll end on your advice to be the first in these new channels, really take advantage of them. And thanks again for coming on Global TV.
Hartley: Yeah, thanks for having me. And I had a good time talking about the stuff that you guys are doing. Hope to see you guys moving up in the search results pretty soon. And on that to talk about this kind of stuff, I don’t do marketing sort of full time anymore on, you know, full stack web developer at a startup. But, you know, I love having these conversations, so I think they’ll probably be a clarity link in the links below. Want to give me a call? You’re in the Boston area. I’m happy to get lunch, a coffee or something like that. Yeah, I’d love to talk about the stuff on the help.
Bronson: Yeah, that’s also partly. Thanks again. Yeah.