Description ⚙

Robin is a site that aims to help people how to choose the parts to assemble their computer Robin collects computer data sales sites, returning the most affordable values.

Used Tools 🛠

Selenium
MySQL
Python

Web-Scraping

The way we found to obtain all the data from the parts of the computer was through Web-Scraping, which is a form of mining that allows us to extract data from websites, converting them into structured information for later analysis, the framework used to obtain these data was Selenium in Python.

Each site has a specific structure for its data:

Pichau

Pichau was certainly the site that we had the most difficulty with, the structure of the site changes with each different computer that it is opened, so a way we found to get the data on different computers that we ran the code and it still worked, was using the Socket lib to save a specific structure for each IP.

import socket
    hostIP = socket.gethostname()           # IP Local
    IP = socket.gethostbyname(hostIP)       # Specif IP

Another issue we had was the small fade-in effect that is applied from the third product line on the site, so the item images only start to appear in the site's HTML source after you scroll down.

To solve this problem, we use a Selenium command to get the full height of the site and make it automatically scroll down according to the size of the site.

from selenium import webdriver
    height = driver.execute_script("return document.body.scrollHeight") 
    while scroll < height:
         driver.execute_script(f"window.scrollTo(0, {scroll});")
         scroll += 200

Problems solved, now it's time to get the specification of each part like price, name, etc...

With that in mind, we chose this list of specifications in Pichau:

Especificações	Dados
Preço parcelado	R$ 771,29
Preço	R$678,74
Nome	MEMORIA TEAM GROUP T-FORCE DELTA RGB, 8GB(1X8GB), DDR4, 3200MHZ, C16, BRANCO, TF4D48G3200HC16C01
Link	https://www.pichau.com.br/memoria-team-group-t-force-delta-rgb-8gb-1x8gb-ddr4-3200mhz-c16-branco-tf4d48g3200hc16c01
Link da imagem	https://media.pichau.com.br/media/catalog/product/cache/2f958555330323e505eba7ce930bdf27/t/f/tf4d48g3200hc16c011.jpg
Horário de Scraping	09/07/2022 23:22:34

How to perform the scraping on Pichau:

Bloco de informação	Código de web-scraping	Explicação
	# Crawling Products == Name product = driver.find_elements('tag name', 'h2') for i in product: if i.text == "": continue namesProducts.append(i.text)	On Pichau's website, product titles are separated into `h2` tags, so we have to pull all h2 tags from the website using `find_elements('tag name', 'h2')`
	# Crawling Products == Image while scroll < height: driver.execute_script(f"window.scrollTo(0, {scroll});") product = driver.find_elements('tag name', 'img') for e in product: if 'product' in e.get_attribute('src'): imgProducts.append(e.get_attribute('src')) scroll += 200 imgProducts = list(dict.fromkeys(imgProducts))	Product images are separated into `img` tags due to the fade-in issue explained above. We have to use a command `driver.execute_script(f"window.scrollTo(0, {scroll});")` inside a loop `while` so that the code is scraping and scrolling down the page, so we have to separate the images from the products using: for e in product: if 'product' in e.get_attribute('src'): imgProducts.append(e.get_attribute('src'))	On Pichau's site, the product titles are separated into `h2` tags, so we have to pull all the h2 tags from the site using `find_elements('tag name', 'h2')`
	# Crawling Products == Image while scroll < height: driver.execute_script(f"window.scrollTo(0, {scroll});") product = driver.find_elements('tag name', 'img') for e in product: if 'product' in e.get_attribute('src'): imgProducts.append(e.get_attribute('src')) scroll += 200 imgProducts = list(dict.fromkeys(imgProducts))	Product images are separated into `img` tags due to the fade-in issue explained above. We have to use a command `driver.execute_script(f"window.scrollTo(0, {scroll});")` inside a loop`while` so that the code is scraping and scrolling down the page, so we have to separate the images from the products using: for e in product: if 'product' in e.get_attribute('src'): imgProducts.append(e.get_attribute('src'))	On Pichau's site, the product titles are separated into `h2` tags, so we have to pull all the h2 tags from the site using `find_elements('tag name', 'h2')`
	# Crawling Products == Image while scroll < height: driver.execute_script(f"window.scrollTo(0, {scroll});") product = driver.find_elements('tag name', 'img') for e in product: if 'product' in e.get_attribute('src'): imgProducts.append(e.get_attribute('src')) scroll += 200 imgProducts = list(dict.fromkeys(imgProducts))	Product images are separated into `img` tags due to the fade-in issue explained above. We have to use a command`driver.execute_script(f"window.scrollTo(0, {scroll});")` inside a loop `while` so that the code is scraping and scrolling down the page, so we have to separate the images from the products using: for e in product: if 'product' in e.get_attribute('src'): imgProducts.append(e.get_attribute('src'))

UNDER DEVELOPMENT

📜Project developed in the Entra21 Matutine Python Class

Name		Name	Last commit message	Last commit date
Latest commit History 269 Commits
.idea		.idea
.vscode		.vscode
img		img
src		src
LICENSE		LICENSE
README-PT-BR.md		README-PT-BR.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description ⚙

Used Tools 🛠

Web-Scraping

Pichau

How to perform the scraping on Pichau:

UNDER DEVELOPMENT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Description ⚙

Used Tools 🛠

Web-Scraping

Pichau

How to perform the scraping on Pichau:

UNDER DEVELOPMENT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages