Visual Basic Webpage Scraper

Submitted by Yorkiebar on Saturday, August 31, 2013 - 05:42.

Introduction: In this tutorial I will be showing you how to create a webpage scraper in Visual Basic. This can be used to gather information from certain websites through an automated process. Steps of Creation: Step 1: First we want to create a form with a simple button (set the name to scrapeButton), a Text Box (set the name to linkURL), a Rich Text Box (set the name to srcBox) and a Web Browser (set the name to srcBrowser). The button will begin the process of grabbing the source from the given page in srcURL Text Box, the source will get put in to srcBox and then srcBrowser will display the srcBox Text. Step 2: Before we start scripting we need to import two namespaces. One for connecting to a website and another for reading the source;

        Imports System.Net
        Imports System.IO

Step 3: For our first script we will be putting it in-between the button click event;

    Private Sub scrapeButton_Click(sender As Object, e As EventArgs) Handles scrapeButton.Click
                'In Here
    End Sub

Step 4: The first part of the script is to ensure that the entered URL is genuine and in the correct format;

    If (Not linkURL.Text = Nothing) Then
        linkURL.Text = linkURL.Text.ToLower()
        If (linkURL.Text.StartsWith("https://") Or linkURL.Text.StartsWith("http://")) Then
            If (Not linkURL.Text.StartsWith("https://www.") And Not linkURL.Text.StartsWith("http://www.")) Then
                If Not (linkURL.Text.StartsWith("www.")) Then
                    If (linkURL.Text.StartsWith("http://")) Then
                        linkURL.Text = "http://www." & linkURL.Text.Substring(7, linkURL.Text.Length - 7)
                    Else
                        linkURL.Text = "https://www." & linkURL.Text.Substring(8, linkURL.Text.Length - 8)
                    End If
                End If
            End If
        ElseIf (linkURL.Text.StartsWith("www.")) Then
            linkURL.Text = "http://" & linkURL.Text
        Else
            linkURL.Text = "http://www." & linkURL.Text
        End If
        End If

Step 5: The next part of the script is the main part of this tutorial and will deal with getting the source of a web page. First we send a connection request;

        Dim req As HttpWebRequest = new HttpWebRequest.create(linkURL.Text)

Step 6: Once we have sent the request we can read the response and put that in to a HttpWebResponse variable

    Dim res As HttpWebResponse = req.GetResponse()

Step 7: Next, we can read the response and essentially turn the response in to text. We do this by using a StreamReader from our System.IO Namespace Import;

        Dim src As String = New StreamReader(res.GetResponseStream()).ReadToEnd()

Step 8: Finally we can simply set the Rich Text Box (srcBox) text value to our web page source (src) and turn the Web Browser (srcBrowser)'s DocumentText in to our source - this is just for testing purposes to see if we get a resemblance between the website we want to scrape and the website source code we are receiving from the response;

    srcBox.Text = src
    srcBrowser.DocumentText = srcBox.Text

Test: As you can see from the below image, after testing our program out on http://www.google.com we received the source code in our srcBox and our srcBrowser is displaying correctly. Great! Extracting Data: Step 1: To make our data extraction easier, we are going to use a single function. This function uses Regular Expression to extract a certain String from another, larger String. Add this to your source code:

        Private Function GetBetween(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String, Optional ByVal Index As Integer = 0) As String
        Return Regex.Split(Regex.Split(Source, Str1)(Index + 1), Str2)(0)
    End Function

To make Regex work, ensure you add the Import at the top of your source code file:

        Imports System.Text.RegularExpressions

Step 2: Now we can extract easily from our source code. The following code will simply out the word "Feeling" by extracting a SubString from our source code which starts straight after "I'm " and just before " Lucky" which is the text clearly shown on the "I'm Feeling Lucky" search button:

    Dim extracted As String = GetBetween(src, "I'm ", " Lucky")
    MsgBox(extracted)

Project Completed! That's it! Here is the finished source code:

Imports System.Net
Imports System.IO
Public Class Form1
 
    Private Sub scrapeButton_Click(sender As Object, e As EventArgs) Handles scrapeButton.Click
        If (Not linkURL.Text = Nothing) Then
            linkURL.Text = linkURL.Text.ToLower()
            If (linkURL.Text.StartsWith("https://") Or linkURL.Text.StartsWith("http://")) Then
                If (Not linkURL.Text.StartsWith("https://www.") And Not linkURL.Text.StartsWith("http://www.")) Then
                    If Not (linkURL.Text.StartsWith("www.")) Then
                        If (linkURL.Text.StartsWith("http://")) Then
                            linkURL.Text = "http://www." & linkURL.Text.Substring(7, linkURL.Text.Length - 7)
                        Else
                            linkURL.Text = "https://www." & linkURL.Text.Substring(8, linkURL.Text.Length - 8)
                        End If
                    End If
                End If
            ElseIf (linkURL.Text.StartsWith("www.")) Then
                linkURL.Text = "http://" & linkURL.Text
            Else
                linkURL.Text = "http://www." & linkURL.Text
            End If
            Dim req As HttpWebRequest = HttpWebRequest.Create(linkURL.Text)
            Dim res As HttpWebResponse = req.GetResponse()
            Dim src As String = New StreamReader(res.GetResponseStream()).ReadToEnd()
            srcBox.Text = src
            srcBrowser.DocumentText = srcBox.Text 'src
        End If
    End Sub
End Class

Comments

https - does not work

This does not work when I try a https website like wikipedia. Do you have any suggestions ?

Log in or register to post comments
1002 views