Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)


ASP.NET Triple Whammy


 


I got a phone call from client last week asking for some help using .NET to grab the contents of a web page full of links. Pulling down the content of a web site is pretty simple using the WebRequest and WebResponse objects. The following code demonstrates how to “scrape” the contents of a web page using ASP.NET


 


‘– Import a couple of assemblies


Imports System.IO


Imports System.Net


 


‘– And run this code


‘– open the channel to web site


Dim oReq As WebRequest = _


   System.Net.HttpWebRequest.Create(“http://www.dashpoint.com”)


 


‘– get a response from the site


Dim oResp As WebResponse = oReq.GetResponse()


 


‘– attach the stream to a reader


Dim oSRead As New StreamReader(oResp.GetResponseStream)


 


‘– get the content


Dim cContent As String = oSRead.ReadToEnd


 


MessageBox.Show(cContent)


 


 


That was pretty simple. Now for the problems:


 


Problem 1: Attaching to HTTPS site with credentials


The site we were accessing was an HTTPS site. So we needed to login with username and password. How the heck do you do that. Its actually pretty simple.


 


If you want to connect to an HTTPS secured site you need to create “Credentials” to hand to the site. You do this by  creating a System.Net.Credentials object and attaching it to the request object like so



 


‘– Create the credentials for HTTPS and


‘– attach them to the request object


Dim oCred As New _


  System.Net.NetworkCredential(<<USERNAME>>, <<PASSWORD>>)


 


oReq.Credentials = oCred


 


 


Problem 2: Attaching to HTTPS site a bad/invalid certificate


Second problem was that the site we were connecting to had a  “questionable” certificate. We received an error in our browser when trying to attach to this site via a browser. Like so:


 


 


This is more common than you would think. So how do you connect to an HTTPS site with a bad certificate? After a little research using Google we found code that discussed overriding the certificate policy by creating a class that implements the ICertificatePolicy interface. The code below demonstrates this class:


 


Imports System.Net


 


Public Class CertificateOverride


 Implements ICertificatePolicy


 


  Public Function CheckValidationResult(. . .)As Boolean  _


    Implements System.Net.ICertificatePolicy.CheckValidationResult


 


       Return True


  End Function


End Class


 


Basically this interface implements one function. When a bad certificate is found a value is handed to the CheckValidationResult and the type of error found is passed into the certificateProblem parameter. The list of possible values for this are as follows:


 

CertEXPIRED                   = 2148204801,
CertVALIDITYPERIODNESTING     = 2148204802,
CertPATHLENCONST              = 2148204804,
CertROLE                      = 2148204803,
CertCRITICAL                  = 2148204805,
CertPURPOSE                   = 2148204806,
CertISSUERCHAINING            = 2148204807,
CertMALFORMED                 = 2148204808,
CertUNTRUSTEDROOT             = 2148204809,
CertCHAINING                  = 2148204810,
CertREVOKED                   = 2148204812,
CertUNTRUSTEDTESTROOT         = 2148204813,
CertREVOCATION_FAILURE        = 2148204814,
CertCN_NO_MATCH               = 2148204815,
CertWRONG_USAGE               = 2148204816,
CertUNTRUSTEDCA               = 2148204818
 

I found these values at the following blog:


http://www.codexchange.net/PreviewSnippet.aspx?SnippetID=d40708fc-4041-42b8-9016-f0ac96d14fce


 


Basically we trusted the site we were connected to so we defaulted the return value to true from this function regardless of the problem.


 


After creating this class we needed to override the certificate management in our code. So we added this code to the top of the function:

‘– over ride the bad certificate error

ServicePointManager.CertificatePolicy = New CertificateOverride

 

So now the complete example looks like this

‘– over ride the bad certificate error


ServicePointManager.CertificatePolicy = New CertificateOverride


 


‘– open the channel to web site


Dim oReq As WebRequest = _ 


System.Net.HttpWebRequest.Create(“http://www.dashpoint.com”)


 


‘– set the credentials for HTTPS


Dim oCred As New System.Net.NetworkCredential(“”, “”)


oReq.Credentials = oCred


 


‘– get a response from the site


Dim oResp As WebResponse = oReq.GetResponse()


 


‘– attach the stream to a reader


Dim oSRead As New StreamReader(oResp.GetResponseStream)


 


‘– get the content


Dim cContent As String = oSRead.ReadToEnd


 


MessageBox.Show(cContent)


 


So now we have a comprehensive example of connecting to sites with bad(or good certs) and using credentials. ONe good thing is that the .NET Framework was capable of doing this every step of the way!


NOTE: In a comment someone pointed out that the CerticatePolicy interface I used was obsolete for the 2.0 Framework (My client still uses VS 2003 and 1.1 framework for there code). Did a little digging and found that this interface is now done via a callback function. the code and class for this are below


Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates



Public Class CertificateOverride


    Public Function RemoteCertificateValidationCallback( _
    ByVal sender As Object, _
    ByVal certificate As X509Certificate, _
    ByVal chain As X509Chain, _
    ByVal sslPolicyErrors As SslPolicyErrors _
       ) As Boolean
    


    Return True


    End Function
End Class
 


 


Dim oCertOverride As New CertificateOverride


‘– over ride the bad certificate error
ServicePointManager.ServerCertificateValidationCallback = _


 AddressOf oCertOverride.RemoteCertificateValidationCallback


‘– open the channel to web site
Dim oReq As WebRequest = _
    System.Net.HttpWebRequest.Create(“
http://www.dashpoint.com“)


‘– set the credentials for HTTPS
Dim oCred As New System.Net.NetworkCredential(“”, “”)      oReq.Credentials = oCred


‘– get a response from the site
Dim oResp As WebResponse = oReq.GetResponse()


‘– attach the stream to a reader
Dim oSRead As New StreamReader(oResp.GetResponseStream)


‘– get the content
Dim cContent As String = oSRead.ReadToEnd


MessageBox.Show(cContent)

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

24 Responses to Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)

  1. chili says:

    Thank Rod, but I have a problem. I ckecked code source and I received statuscode 200, but only this. I would receive a text like CORRECT or INCORRECT but I receive nothing. Would you help me?

    Thank in advance.

  2. chili says:

    Thank Rod, but I have a problem. I ckecked code source and I received statuscode 200, but only this. I would receive a text like CORRECT or INCORRECT but I receive nothing. Would you help me?

    Thank in advance.

  3. Thanks Rod! This code bailed me out big time. I needed to post variables to the page so I added the following before oResp was defined: (example)

    Dim data As String = “action=updateAll&maxPrice=590&sPrice=0&startingPos=0&nbrRecs=60″

    Dim writer As New StreamWriter(oReq.GetRequestStream)
    writer.Write(data)
    writer.Close()

  4. Jason says:

    Rod,

    Thanks for the code…but where do I put it in my project? Can this be used in the webbrowser control?

    Thanks!

  5. ferit says:

    pls help to login
    secure.lme.com/…/Dataprices_daily_metals.aspx
    can you write a code which working true.

    Thanks.

  6. ferit says:

    Manic is right , i want to connect but i did not it. pls help
    Thanks

  7. Jon says:

    Great post! Thanks

  8. Hank says:

    In problem 1 there is mention of a UserName and Password. The site I’m trying to scrape also has a Username and password to get into the site. Is this the same? Or is the elements used for the certificate only?

  9. Neil says:

    Worked a treat. Thanks :) I saw a lot of C# examples, but this was the first VB example I saw.

  10. someone says:

    Thank you very much !

    This is exactly what I was looking for.
    Great help.

  11. Mani says:

    Thanks for the great code. I have a question, if the site require authentication (form authentication) we need to do it through post and then store the cookie to work further.

    The problem I am facing is when I attach the cookie with the second page after successful login, it failed to retrieve any data due to SSL failure. Can you help me on that.

  12. Piyush says:

    hi sir
    i need to login programattically and scrapping data from a website which is the next page after login .
    for this i m using html agility pack.
    i m using the link http://www.dotnetjunkies.com/WebLog/joshuagough/archive/2006/01/20/134825.aspx
    as reference .
    for trial i m trying to login in gmail and code is as following……………………
    using System;
    using System.Data;
    using System.Configuration;
    using System.Web;
    using System.Web.Security;
    using System.Web.UI;
    using System.Web.UI.WebControls;
    using System.Web.UI.WebControls.WebParts;
    using System.Web.UI.HtmlControls;

    public partial class _Default : System.Web.UI.Page
    {
    protected void Page_Load(object sender, EventArgs e)
    {
    FormProcessor p = new FormProcessor();
    string userName = “*****************”;
    string password = “******************”;
    Form form = p.GetForm(“https://Gmail.com”,”//form@name=’loginForm’”, FormQueryModeEnum.Nested);
    form”j_username”.SetAttributeValue(“value”, userName);
    form”j_password”.SetAttributeValue(“value”, password);
    HtmlDocument doc = p.SubmitForm(form);
    string strBal = doc.DocumentNode.SelectSingleNode
    (“//span@class=’redText’”).InnerText;
    strBal = System.Web.HttpUtility.HtmlDecode(strBal);
    strBal = strBal.Substring(1).Trim();
    }
    }

    in which i m facing problem in xpath //form@name=’loginForm’ the error is node not found.
    i want to know that how can i compose the xpath for any website . plz tell me complete reference about it.

    thanks in advance

    sharad soni

  13. Mark says:

    Love it Thanks

  14. Tom says:

    Very useful and simple – thanx !

  15. Jun Meng says:

    DisonWorld: this article is suitable for a web application calling another web service, not for a web browser accessing a web application.

  16. bamboowave says:

    useful posting.

  17. DisonWorld says:

    If the site using the https, then when a user visits the site, a popup same as the picture in this article will show, could anyone tell me how to remove it?

    That is, is there a way to put any codes in the aspx, then the popup will never show again?

    I have tried to put the following codes [C# 2.0] in the global.asax [Application_AuthenticateRequest]
    System.Net.ServicePointManager.ServerCertificateValidationCallback += delegate(object objSender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors)
    {
    return true;
    };

    But the popup still shows after the codes run.

  18. Matt says:

    Super useful code here – I can now make my utility to go and grab my PIX firewall configurations on a regular basis with ease! Sweet & Thank you!

  19. rodpaddock says:

    Good point joshua. The HTTPS didn’t require authentication it was the fact that the web site did. The HTTPS issue was the bad certificate.

  20. Do the credentials really have anything to do with the fact that the site uses https? Isn’t it just because the site requires authentication (windows, or forms)? You would have to supply the credentials, regardless of the protocol. Similarly, if the site was available to anonymous users, I don’t think you would have to supply credentials – even it it was over https. Protecting access (authentication) is different than encrypting the traffic on the wire (https).

  21. gappleby says:

    Hey Rod

    Good to see there’s a VB guy here now that I’ve left codebetter :)

    One thing to note however is that the ServicePointManager.CertificatePolicy property is now marked as obsolete in the 2.0 framework (the code is completely valid in 1.1, and still works in 2.0).

    The warning says that to use ServerCertificateValidationCallback instead.

    I discovered this when I upgraded some old code a couple of months ago – but I haven’t had time to figure out who this new one works yet :)

  22. Adarsh Bhat says:

    That’s useful information. Thanks.

Leave a Reply