Not every website offer an API or mechanism to access the data programmatically, web scraping will be the only way to extract the website information.
There are different tools available to scrape the information from a website and one amongst them is Selenium-webdriver. The rest of the document exclusively deals with selenium.
Before doing anything make sure the gem selenium-webdriver is installed.
gem install selenium-webdriver
bundle install
Let’s Get to Scraping Now...
You should be familiar with atleast the basic html tags to scrape the basic information from the website. Once you know the basics, you are good to go.
The first thing is to run a webdriver. Selenium by default supports Mozilla Firefox browser and in case you want to run the webdriver in chrome, you can simply do it in two steps:
Download the latest version of ChromeDriver server. And then copy the chromedriver into the bin directory to run the webdriver perfectly in chrome.
Scraping in selenium is mainly about retrieving the page and finding the UI elements to display the content.
# scraping.rb
require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
Navigate the driver to the page that you need to scrape and load the url. I will be scraping Yukihiro Matsumoto and will be concentrating on fetching the name, birth place and the image url of the person.
# scraping.rb
require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://en.wikipedia.org/wiki/Yukihiro_Matsumoto"
Now that we have loaded the page, start locating the elements. But, before which define the explicit wait for 20 seconds so that it waits for 20 seconds before throwing a TimeoutException.
To locate any elements of the page, find_element method can be used which will return only one single WebElement where as find_elements method will return a list of WebElement.
# scraping.rb
require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://en.wikipedia.org/wiki/Yukihiro_Matsumoto"
wait = Selenium::WebDriver::Wait.new(:timeout => 20)
In order to get only the text then call text method on the variable to display the name Yukihiro Matsumoto.
# scraping.rb
require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://en.wikipedia.org/wiki/Yukihiro_Matsumoto"
wait = Selenium::WebDriver::Wait.new(:timeout => 20)
name = wait.until {
element_1 = driver.find_element(:class, "firstHeading")
}
puts name.text
# Yukihiro Matsumoto
In order to get the birth place, call the text method on born to obtain the birth place.
# scraping.rb
...
born = wait.until {
element_2 = driver.find_element(:css, ".infobox.biography.vcard")
element_2.find_element(:class, "birthplace")
}
puts born.text
# Osaka Prefecture, Japan
The final thing is to get the image url, hence look for the class name image and call attribute method by passing href as an argument which eventually returns the url.
# scraping.rb
...
image_url = wait.until {
element_3 = driver.find_element(:class, "image").attribute("href")
}
puts image_url
# https://en.wikipedia.org/wiki/File:Yukihiro_Matsumoto.JPG
After everything is scraped, close the driver.
# scraping.rb
...
puts image_url
# https://en.wikipedia.org/wiki/File:Yukihiro_Matsumoto.JPG
driver.quit
Enjoy Scraping!!!
Ameena