Visual Basic Webpage Scraper

Introduction: In this tutorial I will be showing you how to create a webpage scraper in Visual Basic. This can be used to gather information from certain websites through an automated process. Steps of Creation: Step 1: First we want to create a form with a simple button (set the name to scrapeButton), a Text Box (set the name to linkURL), a Rich Text Box (set the name to srcBox) and a Web Browser (set the name to srcBrowser). The button will begin the process of grabbing the source from the given page in srcURL Text Box, the source will get put in to srcBox and then srcBrowser will display the srcBox Text. Step 2: Before we start scripting we need to import two namespaces. One for connecting to a website and another for reading the source;
  1. Imports System.Net
  2. Imports System.IO
Step 3: For our first script we will be putting it in-between the button click event;
  1. Private Sub scrapeButton_Click(sender As Object, e As EventArgs) Handles scrapeButton.Click
  2. 'In Here
  3. End Sub
Step 4: The first part of the script is to ensure that the entered URL is genuine and in the correct format;
  1. If (Not linkURL.Text = Nothing) Then
  2. linkURL.Text = linkURL.Text.ToLower()
  3. If (linkURL.Text.StartsWith("https://") Or linkURL.Text.StartsWith("http://")) Then
  4. If (Not linkURL.Text.StartsWith("https://www.") And Not linkURL.Text.StartsWith("http://www.")) Then
  5. If Not (linkURL.Text.StartsWith("www.")) Then
  6. If (linkURL.Text.StartsWith("http://")) Then
  7. linkURL.Text = "http://www." & linkURL.Text.Substring(7, linkURL.Text.Length - 7)
  8. Else
  9. linkURL.Text = "https://www." & linkURL.Text.Substring(8, linkURL.Text.Length - 8)
  10. End If
  11. End If
  12. End If
  13. ElseIf (linkURL.Text.StartsWith("www.")) Then
  14. linkURL.Text = "http://" & linkURL.Text
  15. Else
  16. linkURL.Text = "http://www." & linkURL.Text
  17. End If
  18. End If
Step 5: The next part of the script is the main part of this tutorial and will deal with getting the source of a web page. First we send a connection request;
  1. Dim req As HttpWebRequest = new HttpWebRequest.create(linkURL.Text)
Step 6: Once we have sent the request we can read the response and put that in to a HttpWebResponse variable
  1. Dim res As HttpWebResponse = req.GetResponse()
Step 7: Next, we can read the response and essentially turn the response in to text. We do this by using a StreamReader from our System.IO Namespace Import;
  1. Dim src As String = New StreamReader(res.GetResponseStream()).ReadToEnd()
Step 8: Finally we can simply set the Rich Text Box (srcBox) text value to our web page source (src) and turn the Web Browser (srcBrowser)'s DocumentText in to our source - this is just for testing purposes to see if we get a resemblance between the website we want to scrape and the website source code we are receiving from the response;
  1. srcBox.Text = src
  2. srcBrowser.DocumentText = srcBox.Text
Test: As you can see from the below image, after testing our program out on http://www.google.com we received the source code in our srcBox and our srcBrowser is displaying correctly. Great! Extracting Data: Step 1: To make our data extraction easier, we are going to use a single function. This function uses Regular Expression to extract a certain String from another, larger String. Add this to your source code:
  1. Private Function GetBetween(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String, Optional ByVal Index As Integer = 0) As String
  2. Return Regex.Split(Regex.Split(Source, Str1)(Index + 1), Str2)(0)
  3. End Function
To make Regex work, ensure you add the Import at the top of your source code file:
  1. Imports System.Text.RegularExpressions
Step 2: Now we can extract easily from our source code. The following code will simply out the word "Feeling" by extracting a SubString from our source code which starts straight after "I'm " and just before " Lucky" which is the text clearly shown on the "I'm Feeling Lucky" search button:
  1. Dim extracted As String = GetBetween(src, "I'm ", " Lucky")
  2. MsgBox(extracted)
Project Completed! That's it! Here is the finished source code:
  1. Imports System.Net
  2. Imports System.IO
  3. Public Class Form1
  4.  
  5. Private Sub scrapeButton_Click(sender As Object, e As EventArgs) Handles scrapeButton.Click
  6. If (Not linkURL.Text = Nothing) Then
  7. linkURL.Text = linkURL.Text.ToLower()
  8. If (linkURL.Text.StartsWith("https://") Or linkURL.Text.StartsWith("http://")) Then
  9. If (Not linkURL.Text.StartsWith("https://www.") And Not linkURL.Text.StartsWith("http://www.")) Then
  10. If Not (linkURL.Text.StartsWith("www.")) Then
  11. If (linkURL.Text.StartsWith("http://")) Then
  12. linkURL.Text = "http://www." & linkURL.Text.Substring(7, linkURL.Text.Length - 7)
  13. Else
  14. linkURL.Text = "https://www." & linkURL.Text.Substring(8, linkURL.Text.Length - 8)
  15. End If
  16. End If
  17. End If
  18. ElseIf (linkURL.Text.StartsWith("www.")) Then
  19. linkURL.Text = "http://" & linkURL.Text
  20. Else
  21. linkURL.Text = "http://www." & linkURL.Text
  22. End If
  23. Dim req As HttpWebRequest = HttpWebRequest.Create(linkURL.Text)
  24. Dim res As HttpWebResponse = req.GetResponse()
  25. Dim src As String = New StreamReader(res.GetResponseStream()).ReadToEnd()
  26. srcBox.Text = src
  27. srcBrowser.DocumentText = srcBox.Text 'src
  28. End If
  29. End Sub
  30. End Class

Comments

Submitted byVladidec (not verified)on Wed, 12/07/2016 - 01:18

This does not work when I try a https website like wikipedia. Do you have any suggestions ?

Add new comment