The Java Web Scraping Handbook

Introduction to Web scraping

Web scraping or crawling is the act of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be done manually, but generally this term refers to the automated process of downloading the HTML content of a page, parsing/extracting the data, and saving it into a database for further analysis or use.

Web fundamentals

The internet is really complex : there are many underlying techologies and concepts involved to view a simple web page in your browser. I don’t have the pretention to explain everything, but I will show you the most important things you have to understand to extract data from the web.

HyperText Transfer Protocol

From Wikipedia :

The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, where hypertext documents include hyperlinks to other resources that the user can easily access, for example by a mouse click or by tapping the screen in a web browser. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text. HTTP is the protocol to exchange or transfer hypertext.

So basically, as in many network protocols, HTTP uses a client/server model, where an HTTP client (A browser, your Java program, curl, wget…) opens a connection and sends a message (“I want to see that page : /product”)to an HTTP server (Nginx, Apache…). Then the server answers with a response (The HTML code for exemple) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independant. FTP for example, is stateful.

Category:	Java

Attribution

Kevin Sahin. The Java Web Scraping Handbook. https://www.scrapingbee.com/java-webscraping-book

VP Flipbook Maker

Convert your work to digital flipbook with VP Online Flipbook Maker! You can also create a new one with the tool. Try it now!

Categories

Forgot Password?

Recommended

Attribution

VP Flipbook Maker

Think Data Structures: Algorithms and Information Retrieval in Java

Think Data Structures: Algorithms and Information Retrieval in Java

Think Java: How to Think Like a Computer Scientist, 2nd Edition

Think Java: How to Think Like a Computer Scientist, 2nd Edition

Java Notes for Professionals

Java Notes for Professionals

Introduction to Programming Using Java

Introduction to Programming Using Java

The JasperReports Ultimate Guide, Third Edition

The JasperReports Ultimate Guide, Third Edition

Java Programming

Java Programming

Java Application Development on Linux (2005)

Java Application Development on Linux (2005)

Welcome to the Java Workshop (2006)

Welcome to the Java Workshop (2006)