Visual Basic Twitter Feed Scraper

Submitted by: 
Language: 
Visitors have accessed this post 1726 times.

Introduction:
Welcome to my tutorial on how to create a Twitter profile tweet scraper. First create a form which contains a textbox for the profile username and a button to begin the process.

Steps of Creation:
Step 1:
Import the following two imports so we can get the profile page source and manipulate it:

  1. Imports System.Net
  2. Imports System.Text.RegularExpressions

Step 2:
Now, add two functions; GetBetween and GetBetweenAll. We will be using these Regex functions to extract our tweets from our web page source.

  1. Private Function GetBetween(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String, Optional ByVal Index As Integer = 0) As String
  2. Return Regex.Split(Regex.Split(Source, Str1)(Index + 1), Str2)(0)
  3. End Function
  4. Private Function GetBetweenAll(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String) As String()
  5. Dim Results, T As New List(Of String)
  6. T.AddRange(Regex.Split(Source, Str1))
  7. T.RemoveAt(0)
  8. For Each I As String In T
  9. Results.Add(Regex.Split(I, Str2)(0))
  10. Next
  11. Return Results.ToArray
  12. End Function

Step 3:
On the button click event we are going to send a request to the profile page of the entered username in textbox1 and get the response (the source code once read):

  1. Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
  2. Dim r As HttpWebRequest = HttpWebRequest.Create("<a href="http://www.twitter.com/"" rel="nofollow">http://www.twitter.com/"</a> & textbox1.text)
  3. Dim re As HttpWebResponse = r.GetResponse()
  4. Dim src As String = New System.IO.StreamReader(re.GetResponseStream()).ReadToEnd()
  5. If (src = Nothing) Then
  6. MsgBox("Error. Src is null")
  7. Else
  8. Dim tweets As String() = getbetweenall(src, "<li class=""js-stream-item stream-item stream-item expanding-stream-item"" data-item-id=""", "</div></div></li>")
  9. If (tweets.Count > 0) Then
  10. Dim tweetcount As Integer = 0
  11. If (Not My.Computer.FileSystem.DirectoryExists(CurDir() & "/" & TextBox1.Text)) Then My.Computer.FileSystem.CreateDirectory(CurDir() & "/" & TextBox1.Text)
  12. For Each tweet As String In tweets
  13. Using sw As New System.IO.StreamWriter(CurDir() & "/" & TextBox1.Text & "/Tweet " & tweetcount & ".txt")
  14. tweetcount += 1
  15. Dim msg As String = GetBetween(tweet, "<p class=""js-tweet-text tweet-text"">", "</p>")
  16. msg = clearTags(msg)
  17. sw.Write(msg)
  18. End Using
  19. Next
  20. End If
  21. End If
  22. End Sub

Once we have read the source code of the page we are extracting all the loaded tweets using the GetBetweenAll function we already added. Then, as long as we have tweets, we are iterating through each one and writing the tweet to a text file in Current Directory > Profile Username > Tweet *TweetCount*.txt. Before we write the tweets we need to clean them of html tags...

Step 4:
Ok so now we have our tweets we need to clean them up so we aren't left with things like """ instead of a quotation mark ("). We are already running the "msg" through our clearTags function so lets create it:

  1. Private Function clearTags(ByVal s As String)
  2. If (s.Contains("<") And s.Contains(">")) Then
  3. Dim toreturn As String = ""
  4. Dim shouldadd As Boolean = True
  5. For Each c As Char In s
  6. If (c = "<") Then shouldadd = False
  7. If (c = ">") Then shouldadd = True
  8. If (Not c = "<" And Not c = ">" And shouldadd) Then
  9. toreturn &= c
  10. End If
  11. Next
  12. If (toreturn.Contains("&#39;")) Then
  13. toreturn = toreturn.Replace("&#39;", "'")
  14. End If
  15. If (toreturn.Contains("&nbsp;")) Then
  16. toreturn = toreturn.Replace("&nbsp;", " ")
  17. End If
  18. If (toreturn.Contains("&quot;")) Then
  19. toreturn = toreturn.Replace("&quot;", """")
  20. End If
  21. Return toreturn
  22. Else
  23. Dim s2 As String = ""
  24. If (s2.Contains("&#39;")) Then
  25. s2 = s2.Replace("&#39;", "'")
  26. End If
  27. If (s2.Contains("&nbsp;")) Then
  28. s2 = s2.Replace("&nbsp;", " ")
  29. End If
  30. If (s2.Contains("&quot;")) Then
  31. s2 = s2.Replace("&quot;", """")
  32. End If
  33. Return s2
  34. End If
  35. End Function

Note: I might not of got all the replacements but these are the only ones I could see. If you see any more just add more replacements in the above script.

Project Complete!
Below you will find the complete source code along with a download the full project:

  1. Imports System.Net
  2. Imports System.Text.RegularExpressions
  3. Public Class Form1
  4. Private Function GetBetween(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String, Optional ByVal Index As Integer = 0) As String
  5. Return Regex.Split(Regex.Split(Source, Str1)(Index + 1), Str2)(0)
  6. End Function
  7. Private Function GetBetweenAll(ByVal Source As String, ByVal Str1 As String, ByVal Str2 As String) As String()
  8. Dim Results, T As New List(Of String)
  9. T.AddRange(Regex.Split(Source, Str1))
  10. T.RemoveAt(0)
  11. For Each I As String In T
  12. Results.Add(Regex.Split(I, Str2)(0))
  13. Next
  14. Return Results.ToArray
  15. End Function
  16. Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
  17. Dim r As HttpWebRequest = HttpWebRequest.Create("<a href="http://www.twitter.com/"" rel="nofollow">http://www.twitter.com/"</a> & textbox1.text)
  18. Dim re As HttpWebResponse = r.GetResponse()
  19. Dim src As String = New System.IO.StreamReader(re.GetResponseStream()).ReadToEnd()
  20. If (src = Nothing) Then
  21. MsgBox("Error. Src is null")
  22. Else
  23. Dim tweets As String() = getbetweenall(src, "<li class=""js-stream-item stream-item stream-item expanding-stream-item"" data-item-id=""", "</div></div></li>")
  24. If (tweets.Count > 0) Then
  25. Dim tweetcount As Integer = 0
  26. If (Not My.Computer.FileSystem.DirectoryExists(CurDir() & "/" & TextBox1.Text)) Then My.Computer.FileSystem.CreateDirectory(CurDir() & "/" & TextBox1.Text)
  27. For Each tweet As String In tweets
  28. Using sw As New System.IO.StreamWriter(CurDir() & "/" & TextBox1.Text & "/Tweet " & tweetcount & ".txt")
  29. tweetcount += 1
  30. Dim msg As String = GetBetween(tweet, "<p class=""js-tweet-text tweet-text"">", "</p>")
  31. msg = clearTags(msg)
  32. sw.Write(msg)
  33. End Using
  34. Next
  35. End If
  36. End If
  37. End Sub
  38.  
  39. Private Function clearTags(ByVal s As String)
  40. If (s.Contains("<") And s.Contains(">")) Then
  41. Dim toreturn As String = ""
  42. Dim shouldadd As Boolean = True
  43. For Each c As Char In s
  44. If (c = "<") Then shouldadd = False
  45. If (c = ">") Then shouldadd = True
  46. If (Not c = "<" And Not c = ">" And shouldadd) Then
  47. toreturn &= c
  48. End If
  49. Next
  50. If (toreturn.Contains("&#39;")) Then
  51. toreturn = toreturn.Replace("&#39;", "'")
  52. End If
  53. If (toreturn.Contains("&nbsp;")) Then
  54. toreturn = toreturn.Replace("&nbsp;", " ")
  55. End If
  56. If (toreturn.Contains("&quot;")) Then
  57. toreturn = toreturn.Replace("&quot;", """")
  58. End If
  59. Return toreturn
  60. Else
  61. Dim s2 As String = ""
  62. If (s2.Contains("&#39;")) Then
  63. s2 = s2.Replace("&#39;", "'")
  64. End If
  65. If (s2.Contains("&nbsp;")) Then
  66. s2 = s2.Replace("&nbsp;", " ")
  67. End If
  68. If (s2.Contains("&quot;")) Then
  69. s2 = s2.Replace("&quot;", """")
  70. End If
  71. Return s2
  72. End If
  73. End Function
  74. End Class

Note: Due to the size or complexity of this submission, the author has submitted it as a .zip file to shorten your download time. After downloading it, you will need a program like Winzip to decompress it.

Virus note: All files are scanned once-a-day by SourceCodester.com for viruses, but new viruses come out every day, so no prevention program can catch 100% of them.

FOR YOUR OWN SAFETY, PLEASE:

1. Re-scan downloaded files using your personal virus checker before using it.
2. NEVER, EVER run compiled files (.exe's, .ocx's, .dll's etc.)--only run source code.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • You may insert videos with [video:URL]
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <th> <img> <h1> <h2> <h3> <iframe> [video]
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <asp>, <c>, <cpp>, <csharp>, <css>, <html4strict>, <java>, <javascript>, <mysql>, <php>, <python>, <sql>, <vb>, <vbnet>. The supported tag styles are: <foo>, [foo].
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.