Introduction to Web scraping
Web scraping or crawling is the act of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be done manually, but generally this term refers to the automated process of downloading the HTML content of a page, parsing/extracting the data, and saving it into a database for further analysis or use.
Web fundamentals
The internet is really complex : there are many underlying techologies and concepts involved to view a simple web page in your browser. I don’t have the pretention to explain everything, but I will show you the most important things you have to understand to extract data from the web.
HyperText Transfer Protocol
From Wikipedia :
The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, where hypertext documents include hyperlinks to other resources that the user can easily access, for example by a mouse click or by tapping the screen in a web browser. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text. HTTP is the protocol to exchange or transfer hypertext.
So basically, as in many network protocols, HTTP uses a client/server model, where an HTTP client (A browser, your Java program, curl, wget…) opens a connection and sends a message (“I want to see that page : /product”)to an HTTP server (Nginx, Apache…). Then the server answers with a response (The HTML code for exemple) and closes the connection. HTTP is called a stateless protocol, because each transaction (request/response) is independant. FTP for example, is stateful.