CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Rod Paddock


Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)

ASP.NET Triple Whammy

 

I got a phone call from client last week asking for some help using .NET to grab the contents of a web page full of links. Pulling down the content of a web site is pretty simple using the WebRequest and WebResponse objects. The following code demonstrates how to "scrape" the contents of a web page using ASP.NET

 

'-- Import a couple of assemblies

Imports System.IO

Imports System.Net

 

'-- And run this code

'-- open the channel to web site

Dim oReq As WebRequest = _

   System.Net.HttpWebRequest.Create("http://www.dashpoint.com")

 

'-- get a response from the site

Dim oResp As WebResponse = oReq.GetResponse()

 

'-- attach the stream to a reader

Dim oSRead As New StreamReader(oResp.GetResponseStream)

 

'-- get the content

Dim cContent As String = oSRead.ReadToEnd

 

MessageBox.Show(cContent)

 

 

That was pretty simple. Now for the problems:

 

Problem 1: Attaching to HTTPS site with credentials

The site we were accessing was an HTTPS site. So we needed to login with username and password. How the heck do you do that. Its actually pretty simple.

 

If you want to connect to an HTTPS secured site you need to create "Credentials" to hand to the site. You do this by  creating a System.Net.Credentials object and attaching it to the request object like so

 

'-- Create the credentials for HTTPS and

'-- attach them to the request object

Dim oCred As New _

  System.Net.NetworkCredential(<<USERNAME>>, <<PASSWORD>>)

 

oReq.Credentials = oCred

 

 

Problem 2: Attaching to HTTPS site a bad/invalid certificate

Second problem was that the site we were connecting to had a  "questionable" certificate. We received an error in our browser when trying to attach to this site via a browser. Like so:

 

 

This is more common than you would think. So how do you connect to an HTTPS site with a bad certificate? After a little research using Google we found code that discussed overriding the certificate policy by creating a class that implements the ICertificatePolicy interface. The code below demonstrates this class:

 

Imports System.Net

 

Public Class CertificateOverride

 Implements ICertificatePolicy

 

  Public Function CheckValidationResult(. . .)As Boolean  _

    Implements System.Net.ICertificatePolicy.CheckValidationResult

 

       Return True

  End Function

End Class

 

Basically this interface implements one function. When a bad certificate is found a value is handed to the CheckValidationResult and the type of error found is passed into the certificateProblem parameter. The list of possible values for this are as follows:

 

CertEXPIRED                   = 2148204801,
CertVALIDITYPERIODNESTING     = 2148204802,
CertPATHLENCONST              = 2148204804,
CertROLE                      = 2148204803,
CertCRITICAL                  = 2148204805,
CertPURPOSE                   = 2148204806,
CertISSUERCHAINING            = 2148204807,
CertMALFORMED                 = 2148204808,
CertUNTRUSTEDROOT             = 2148204809,
CertCHAINING                  = 2148204810,
CertREVOKED                   = 2148204812,
CertUNTRUSTEDTESTROOT         = 2148204813,
CertREVOCATION_FAILURE        = 2148204814,
CertCN_NO_MATCH               = 2148204815,
CertWRONG_USAGE               = 2148204816,
CertUNTRUSTEDCA               = 2148204818
 

I found these values at the following blog:

http://www.codexchange.net/PreviewSnippet.aspx?SnippetID=d40708fc-4041-42b8-9016-f0ac96d14fce

 

Basically we trusted the site we were connected to so we defaulted the return value to true from this function regardless of the problem.

 

After creating this class we needed to override the certificate management in our code. So we added this code to the top of the function:

'-- over ride the bad certificate error

ServicePointManager.CertificatePolicy = New CertificateOverride

 

So now the complete example looks like this

'-- over ride the bad certificate error

ServicePointManager.CertificatePolicy = New CertificateOverride

 

'-- open the channel to web site

Dim oReq As WebRequest = _ 

System.Net.HttpWebRequest.Create("http://www.dashpoint.com")

 

'-- set the credentials for HTTPS

Dim oCred As New System.Net.NetworkCredential("", "")

oReq.Credentials = oCred

 

'-- get a response from the site

Dim oResp As WebResponse = oReq.GetResponse()

 

'-- attach the stream to a reader

Dim oSRead As New StreamReader(oResp.GetResponseStream)

 

'-- get the content

Dim cContent As String = oSRead.ReadToEnd

 

MessageBox.Show(cContent)

 

So now we have a comprehensive example of connecting to sites with bad(or good certs) and using credentials. ONe good thing is that the .NET Framework was capable of doing this every step of the way!

NOTE: In a comment someone pointed out that the CerticatePolicy interface I used was obsolete for the 2.0 Framework (My client still uses VS 2003 and 1.1 framework for there code). Did a little digging and found that this interface is now done via a callback function. the code and class for this are below

Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates


Public Class CertificateOverride

    Public Function RemoteCertificateValidationCallback( _
    ByVal sender As Object, _
    ByVal certificate As X509Certificate, _
    ByVal chain As X509Chain, _
    ByVal sslPolicyErrors As SslPolicyErrors _
       ) As Boolean
    

    Return True

    End Function
End Class
 

 

Dim oCertOverride As New CertificateOverride

'-- over ride the bad certificate error
ServicePointManager.ServerCertificateValidationCallback = _

 AddressOf oCertOverride.RemoteCertificateValidationCallback

'-- open the channel to web site
Dim oReq As WebRequest = _
    System.Net.HttpWebRequest.Create("
http://www.dashpoint.com")

'-- set the credentials for HTTPS
Dim oCred As New System.Net.NetworkCredential("", "")      oReq.Credentials = oCred

'-- get a response from the site
Dim oResp As WebResponse = oReq.GetResponse()

'-- attach the stream to a reader
Dim oSRead As New StreamReader(oResp.GetResponseStream)

'-- get the content
Dim cContent As String = oSRead.ReadToEnd

MessageBox.Show(cContent)



Comments

Sahil Malik said:

Good post.
# May 6, 2006 11:03 PM

Adarsh Bhat said:

That's useful information. Thanks.
# May 7, 2006 1:04 AM

Geoff Appleby said:

Hey Rod

Good to see there's a VB guy here now that I've left codebetter :)

One thing to note however is that the ServicePointManager.CertificatePolicy property is now marked as obsolete in the 2.0 framework (the code is completely valid in 1.1, and still works in 2.0).

The warning says that to use ServerCertificateValidationCallback instead.

I discovered this when I upgraded some old code a couple of months ago - but I haven't had time to figure out who this new one works yet :)
# May 7, 2006 4:31 AM

Jason Haley said:

# May 7, 2006 12:48 PM

Joshua Flanagan said:

Do the credentials really have anything to do with the fact that the site uses https? Isn't it just because the site requires authentication (windows, or forms)? You would have to supply the credentials, regardless of the protocol. Similarly, if the site was available to anonymous users, I don't think you would have to supply credentials - even it it was over https. Protecting access (authentication) is different than encrypting the traffic on the wire (https).
# May 7, 2006 9:23 PM

Rod Paddock [MVP] said:

Good point joshua. The HTTPS didn't require authentication it was the fact that the web site did. The HTTPS issue was the bad certificate.

# May 7, 2006 11:30 PM

Matt said:

Super useful code here - I can now make my utility to go and grab my PIX firewall configurations on a regular basis with ease! Sweet & Thank you!
# August 3, 2006 1:24 PM

DisonWorld said:

If the site using the https, then when a user visits the site, a popup same as the picture in this article will show, could anyone tell me how to remove it?

That is, is there a way to put any codes in the aspx, then the popup will never show again?

I have tried to put the following codes [C# 2.0] in the global.asax [Application_AuthenticateRequest]
System.Net.ServicePointManager.ServerCertificateValidationCallback += delegate(object objSender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors)
{
       return true;
};

But the popup still shows after the codes run.
# August 11, 2006 12:19 AM

bamboowave said:

useful posting.
# August 18, 2006 1:44 PM

Jun Meng said:

DisonWorld: this article is suitable for a web application calling another web service, not for a web browser accessing a web application.
# August 22, 2006 4:46 PM

Tom said:

Very useful and simple - thanx !

# September 21, 2006 8:23 AM

Mark said:

Love it Thanks

# May 1, 2007 12:24 AM

Piyush said:

hi sir

i need to login programattically and scrapping data from a website which is the next page after login .

for this i m using html agility pack.

i m using the link www.dotnetjunkies.com/.../134825.aspx

as reference .

for trial i m trying to login in gmail and code is as following........................

using System;

using System.Data;

using System.Configuration;

using System.Web;

using System.Web.Security;

using System.Web.UI;

using System.Web.UI.WebControls;

using System.Web.UI.WebControls.WebParts;

using System.Web.UI.HtmlControls;

public partial class _Default : System.Web.UI.Page

{

protected void Page_Load(object sender, EventArgs e)

{

FormProcessor p = new FormProcessor();

string userName = "*****************";

string password = "******************";

Form form = p.GetForm("https://Gmail.com","//form@name='loginForm'", FormQueryModeEnum.Nested);

form"j_username".SetAttributeValue("value", userName);

form"j_password".SetAttributeValue("value", password);

HtmlDocument doc = p.SubmitForm(form);

string strBal = doc.DocumentNode.SelectSingleNode

("//span@class='redText'").InnerText;

strBal = System.Web.HttpUtility.HtmlDecode(strBal);

strBal = strBal.Substring(1).Trim();

}

}

in which i m facing problem in xpath //form@name='loginForm' the error is node not found.

i want to know that how can i compose the xpath for any website . plz tell me complete reference about it.

thanks in advance

sharad soni

# July 30, 2007 2:31 AM

Mani said:

Thanks for the great code. I have a question, if the site require authentication (form authentication) we need to do it through post and then store the cookie to work further.

The problem I am facing is when I attach the cookie with the second page after successful login, it failed to retrieve any data due to SSL failure. Can you help me on that.

# October 8, 2007 11:46 PM

someone said:

Thank you very much !

This is exactly what I was looking for.

Great help.

# February 14, 2008 2:41 PM

Neil said:

Worked a treat.  Thanks :) I saw a lot of C# examples, but this was the first VB example I saw.

# March 20, 2008 1:37 AM

Hank said:

In problem 1 there is mention of a UserName and Password. The site I'm trying to scrape also has a Username and password to get into the site. Is this the same? Or is the elements used for the certificate only?

# June 17, 2008 10:51 AM

Jon said:

Great post! Thanks

# July 21, 2008 4:46 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add
Check out Devlicio.us!

Our Sponsors

Free Tech Publications