You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Robin is a site that aims to help people how to choose the parts to assemble their computer Robin collects computer data sales sites, returning the most affordable values.
Used Tools 🛠
Selenium
MySQL
Python
Web-Scraping
The way we found to obtain all the data from the parts of the computer was through Web-Scraping, which is a form of mining that allows us to extract data from websites, converting them into structured information for later analysis, the framework used to obtain these data was Selenium in Python.
Each site has a specific structure for its data:
Pichau
Pichau was certainly the site that we had the most difficulty with, the structure of the site changes with each different computer that it is opened, so a way we found to get the data on different computers that we ran the code and it still worked, was using the Socket lib to save a specific structure for each IP.
importsockethostIP=socket.gethostname() # IP LocalIP=socket.gethostbyname(hostIP) # Specif IP
Another issue we had was the small fade-in effect that is applied from the third product line on the site, so the item images only start to appear in the site's HTML source after you scroll down.
To solve this problem, we use a Selenium command to get the full height of the site and make it automatically scroll down according to the size of the site.
Product images are separated into img tags due to the fade-in issue explained above. We have to use a command driver.execute_script(f"window.scrollTo(0, {scroll});") inside a loop while so that the code is scraping and scrolling down the page, so we have to separate the images from the products using:
On Pichau's site, the product titles are separated into h2 tags, so we have to pull all the h2 tags from the site using
find_elements('tag name', 'h2')
Product images are separated into img tags due to the fade-in issue explained above. We have to use a command driver.execute_script(f"window.scrollTo(0, {scroll});") inside a loopwhile so that the code is scraping and scrolling down the page, so we have to separate the images from the products using:
On Pichau's site, the product titles are separated into h2 tags, so we have to pull all the h2 tags from the site using
find_elements('tag name', 'h2')
Product images are separated into img tags due to the fade-in issue explained above. We have to use a commanddriver.execute_script(f"window.scrollTo(0, {scroll});") inside a loop while so that the code is scraping and scrolling down the page, so we have to separate the images from the products using: