Rod Paddock

Sponsors

The Lounge

Advertisement

Images in this post missing? We recently lost them in a site migration. We're working to restore these as you read this. Should you need an image in an emergency, please contact us at imagehelp@codebetter.com
Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)

ASP.NET Triple Whammy

 

I got a phone call from client last week asking for some help using .NET to grab the contents of a web page full of links. Pulling down the content of a web site is pretty simple using the WebRequest and WebResponse objects. The following code demonstrates how to "scrape" the contents of a web page using ASP.NET

 

'-- Import a couple of assemblies

Imports System.IO

Imports System.Net

 

'-- And run this code

'-- open the channel to web site

Dim oReq As WebRequest = _

   System.Net.HttpWebRequest.Create("http://www.dashpoint.com")

 

'-- get a response from the site

Dim oResp As WebResponse = oReq.GetResponse()

 

'-- attach the stream to a reader

Dim oSRead As New StreamReader(oResp.GetResponseStream)

 

'-- get the content

Dim cContent As String = oSRead.ReadToEnd

 

MessageBox.Show(cContent)

 

 

That was pretty simple. Now for the problems:

 

Problem 1: Attaching to HTTPS site with credentials

The site we were accessing was an HTTPS site. So we needed to login with username and password. How the heck do you do that. Its actually pretty simple.

 

If you want to connect to an HTTPS secured site you need to create "Credentials" to hand to the site. You do this by  creating a System.Net.Credentials object and attaching it to the request object like so

 

'-- Create the credentials for HTTPS and

'-- attach them to the request object

Dim oCred As New _

  System.Net.NetworkCredential(<<USERNAME>>, <<PASSWORD>>)

 

oReq.Credentials = oCred

 

 

Problem 2: Attaching to HTTPS site a bad/invalid certificate

Second problem was that the site we were connecting to had a  "questionable" certificate. We received an error in our browser when trying to attach to this site via a browser. Like so:

 

 

This is more common than you would think. So how do you connect to an HTTPS site with a bad certificate? After a little research using Google we found code that discussed overriding the certificate policy by creating a class that implements the ICertificatePolicy interface. The code below demonstrates this class:

 

Imports System.Net

 

Public Class CertificateOverride

 Implements ICertificatePolicy

 

  Public Function CheckValidationResult(. . .)As Boolean  _

    Implements System.Net.ICertificatePolicy.CheckValidationResult

 

       Return True

  End Function

End Class

 

Basically this interface implements one function. When a bad certificate is found a value is handed to the CheckValidationResult and the type of error found is passed into the certificateProblem parameter. The list of possible values for this are as follows:

 

CertEXPIRED                   = 2148204801,
CertVALIDITYPERIODNESTING     = 2148204802,
CertPATHLENCONST              = 2148204804,
CertROLE                      = 2148204803,
CertCRITICAL                  = 2148204805,
CertPURPOSE                   = 2148204806,
CertISSUERCHAINING            = 2148204807,
CertMALFORMED                 = 2148204808,
CertUNTRUSTEDROOT             = 2148204809,
CertCHAINING                  = 2148204810,
CertREVOKED                   = 2148204812,
CertUNTRUSTEDTESTROOT         = 2148204813,
CertREVOCATION_FAILURE        = 2148204814,
CertCN_NO_MATCH               = 2148204815,
CertWRONG_USAGE               = 2148204816,
CertUNTRUSTEDCA               = 2148204818
 

I found these values at the following blog:

http://www.codexchange.net/PreviewSnippet.aspx?SnippetID=d40708fc-4041-42b8-9016-f0ac96d14fce

 

Basically we trusted the site we were connected to so we defaulted the return value to true from this function regardless of the problem.

 

After creating this class we needed to override the certificate management in our code. So we added this code to the top of the function:

'-- over ride the bad certificate error

ServicePointManager.CertificatePolicy = New CertificateOverride

 

So now the complete example looks like this

'-- over ride the bad certificate error

ServicePointManager.CertificatePolicy = New CertificateOverride

 

'-- open the channel to web site

Dim oReq As WebRequest = _ 

System.Net.HttpWebRequest.Create("http://www.dashpoint.com")

 

'-- set the credentials for HTTPS

Dim oCred As New System.Net.NetworkCredential("", "")

oReq.Credentials = oCred

 

'-- get a response from the site

Dim oResp As WebResponse = oReq.GetResponse()

 

'-- attach the stream to a reader

Dim oSRead As New StreamReader(oResp.GetResponseStream)

 

'-- get the content

Dim cContent As String = oSRead.ReadToEnd

 

MessageBox.Show(cContent)

 

So now we have a comprehensive example of connecting to sites with bad(or good certs) and using credentials. ONe good thing is that the .NET Framework was capable of doing this every step of the way!

NOTE: In a comment someone pointed out that the CerticatePolicy interface I used was obsolete for the 2.0 Framework (My client still uses VS 2003 and 1.1 framework for there code). Did a little digging and found that this interface is now done via a callback function. the code and class for this are below

Imports System.Net
Imports System.Net.Security
Imports System.Security.Cryptography.X509Certificates


Public Class CertificateOverride

    Public Function RemoteCertificateValidationCallback( _
    ByVal sender As Object, _
    ByVal certificate As X509Certificate, _
    ByVal chain As X509Chain, _
    ByVal sslPolicyErrors As SslPolicyErrors _
       ) As Boolean
    

    Return True

    End Function
End Class
 

 

Dim oCertOverride As New CertificateOverride

'-- over ride the bad certificate error
ServicePointManager.ServerCertificateValidationCallback = _

 AddressOf oCertOverride.RemoteCertificateValidationCallback

'-- open the channel to web site
Dim oReq As WebRequest = _
    System.Net.HttpWebRequest.Create("
http://www.dashpoint.com")

'-- set the credentials for HTTPS
Dim oCred As New System.Net.NetworkCredential("", "")      oReq.Credentials = oCred

'-- get a response from the site
Dim oResp As WebResponse = oReq.GetResponse()

'-- attach the stream to a reader
Dim oSRead As New StreamReader(oResp.GetResponseStream)

'-- get the content
Dim cContent As String = oSRead.ReadToEnd

MessageBox.Show(cContent)


Posted 05-06-2006 10:50 AM by Rod Paddock [MVP]

[Advertisement]

Comments

Sahil Malik wrote re: Scraping Secure HTTPS Sites
on 05-06-2006 11:03 PM
Good post.
Adarsh Bhat wrote re: Scraping Secure HTTPS Sites
on 05-07-2006 1:04 AM
That's useful information. Thanks.
Geoff Appleby wrote re: Scraping Secure HTTPS Sites
on 05-07-2006 4:31 AM
Hey Rod

Good to see there's a VB guy here now that I've left codebetter :)

One thing to note however is that the ServicePointManager.CertificatePolicy property is now marked as obsolete in the 2.0 framework (the code is completely valid in 1.1, and still works in 2.0).

The warning says that to use ServerCertificateValidationCallback instead.

I discovered this when I upgraded some old code a couple of months ago - but I haven't had time to figure out who this new one works yet :)
Jason Haley wrote Interesting Finds
on 05-07-2006 12:48 PM
Joshua Flanagan wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 05-07-2006 9:23 PM
Do the credentials really have anything to do with the fact that the site uses https? Isn't it just because the site requires authentication (windows, or forms)? You would have to supply the credentials, regardless of the protocol. Similarly, if the site was available to anonymous users, I don't think you would have to supply credentials - even it it was over https. Protecting access (authentication) is different than encrypting the traffic on the wire (https).
Rod Paddock [MVP] wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 05-07-2006 11:30 PM
Good point joshua. The HTTPS didn't require authentication it was the fact that the web site did. The HTTPS issue was the bad certificate.

Matt wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 08-03-2006 1:24 PM
Super useful code here - I can now make my utility to go and grab my PIX firewall configurations on a regular basis with ease! Sweet & Thank you!
DisonWorld wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 08-11-2006 12:19 AM
If the site using the https, then when a user visits the site, a popup same as the picture in this article will show, could anyone tell me how to remove it?

That is, is there a way to put any codes in the aspx, then the popup will never show again?

I have tried to put the following codes [C# 2.0] in the global.asax [Application_AuthenticateRequest]
System.Net.ServicePointManager.ServerCertificateValidationCallback += delegate(object objSender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors)
{
       return true;
};

But the popup still shows after the codes run.
bamboowave wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 08-18-2006 1:44 PM
useful posting.
Jun Meng wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 08-22-2006 4:46 PM
DisonWorld: this article is suitable for a web application calling another web service, not for a web browser accessing a web application.
Tom wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 09-21-2006 8:23 AM

Very useful and simple - thanx !

Mark wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 05-01-2007 12:24 AM

Love it Thanks

Piyush wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 07-30-2007 2:31 AM

hi sir

i need to login programattically and scrapping data from a website which is the next page after login .

for this i m using html agility pack.

i m using the link www.dotnetjunkies.com/.../134825.aspx

as reference .

for trial i m trying to login in gmail and code is as following........................

using System;

using System.Data;

using System.Configuration;

using System.Web;

using System.Web.Security;

using System.Web.UI;

using System.Web.UI.WebControls;

using System.Web.UI.WebControls.WebParts;

using System.Web.UI.HtmlControls;

public partial class _Default : System.Web.UI.Page

{

protected void Page_Load(object sender, EventArgs e)

{

FormProcessor p = new FormProcessor();

string userName = "*****************";

string password = "******************";

Form form = p.GetForm("https://Gmail.com","//form@name='loginForm'", FormQueryModeEnum.Nested);

form"j_username".SetAttributeValue("value", userName);

form"j_password".SetAttributeValue("value", password);

HtmlDocument doc = p.SubmitForm(form);

string strBal = doc.DocumentNode.SelectSingleNode

("//span@class='redText'").InnerText;

strBal = System.Web.HttpUtility.HtmlDecode(strBal);

strBal = strBal.Substring(1).Trim();

}

}

in which i m facing problem in xpath //form@name='loginForm' the error is node not found.

i want to know that how can i compose the xpath for any website . plz tell me complete reference about it.

thanks in advance

sharad soni

Mani wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 10-08-2007 11:46 PM

Thanks for the great code. I have a question, if the site require authentication (form authentication) we need to do it through post and then store the cookie to work further.

The problem I am facing is when I attach the cookie with the second page after successful login, it failed to retrieve any data due to SSL failure. Can you help me on that.

someone wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 02-14-2008 2:41 PM

Thank you very much !

This is exactly what I was looking for.

Great help.

Neil wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 03-20-2008 1:37 AM

Worked a treat.  Thanks :) I saw a lot of C# examples, but this was the first VB example I saw.

Hank wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 06-17-2008 10:51 AM

In problem 1 there is mention of a UserName and Password. The site I'm trying to scrape also has a Username and password to get into the site. Is this the same? Or is the elements used for the certificate only?

Jon wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 07-21-2008 4:46 AM

Great post! Thanks

Manic wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 12-28-2008 3:43 PM

This is not working for me. Check this page:

secure.lme.com/.../Dataprices_daily_metals.aspx

ferit wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 02-23-2009 11:31 AM

Manic is right , i want to connect but i did not it. pls help

Thanks

ferit wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 02-24-2009 7:36 AM

pls help to login  

secure.lme.com/.../Dataprices_daily_metals.aspx

can you write a code which working true.

Thanks.

Jason wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 02-25-2009 1:49 PM

Rod,

Thanks for the code...but where do I put it in my project?  Can this be used in the webbrowser control?

Thanks!

Anthony (abev) wrote re: Scraping Secure HTTPS Sites (Updates for .NET 2.0 Framework)
on 09-17-2009 9:38 AM

Thanks Rod! This code bailed me out big time. I needed to post variables to the page so I added the following before oResp was defined: (example)

Dim data As String = "action=updateAll&maxPrice=590&sPrice=0&startingPos=0&nbrRecs=60"

Dim writer As New StreamWriter(oReq.GetRequestStream)

       writer.Write(data)

       writer.Close()

Add a Comment

(required)  
(optional)
(required)  
Remember Me?