At the moment I'm migrating my website. The old version is a bunch of static htm pages scribbled with FrontPage, the new version is an asp.net 2.0 site. As the old site is well read I don't want to break any links, a link to
http://www.gekko-software.nl/DotNet/Art01.htm should keep serving the same "Delphi vs. C# article" The easiest way would be to just copy the files and use IIS as a dumb "htm-file-server". But I want to have control over the pages; display them in a nice .net (master) page and add functionality as desired. An option could be to use frames, one frame for the aspx, one frame for the htm. But for many reasons I don't find frames very nice to work with. So here I'll present a pure .net solution.
In the first step the handler of an htm request to the site (like http://www.gekko-software.nl/DotNet/Art01.htm) has to be set to asp.net. You do this in the configuration of the virtual directory in IIS. Add the htm extension in the application configuration list and set the executable to aspnet_isapi.dll.
Now every incoming htm request for the virtual directory will be handled by asp.net.
To intercept the request and redirect it to my viewer I install a so called HttpModule. An httpmodule is a way to be the first or the last in the handling of any request coming to the site. Installing a HttpModule is done in the web.config
<system.web>
<httpModules>
<add name="UrlRewriter" type="Gekko.WebSite.URLrewriter"/>
</httpModules>
The module has a name and a type. This type is a class which implements the IHttpModule interface. It is in the app_code folder of the site.
namespace Gekko.WebSite
{
public class URLrewriter : IHttpModule
{
#region IHttpModule Members
public void Dispose()
{
}
public void Init(HttpApplication context)
{
context.BeginRequest += new EventHandler(context_BeginRequest);
}
#endregion
void context_BeginRequest(object sender, EventArgs e)
{
HttpApplication httpApp = sender as HttpApplication;
string pageName = httpApp.Request.AppRelativeCurrentExecutionFilePath;
if (pageName.EndsWith(".htm"))
{
httpApp.Context.RewritePath(string.Format(@"~/ArticleViewer.aspx?article={0}", pageName.Substring(2)), false);
}
}
}
}
IHttpModule is a nice lean interface. The init method is passed the full context of you web application. I add a handler to the beginrequest event. Which gives my code a first look at every request coming in and even the possibility to change the request. The method filters out any request for an htm and, using the Context.RewritePath method, rewrites the request url to that of my aspx page with the viewer. It passes the desired htm file in the querystring. Now all request for an htm will be served by my asp.net 2.0 code.
(You can do a lot more with HttpModules. There are many events to hook into. The module is the first and the last one to handle, bend, modify or analyze all requests served by your app. There are loads and loads of good samples to be found all over the web)
Now the viewer has to display the htm. How will it do that ? The easy part is that you can assign any html to the text property of a label. The result will be that the page rendered displays the htm in its full glory. But I want to be a neat citizen on the web and not render any garbage. The original htm of my pages has a lot of bla-bla Frontpage headers. What my code will do is extract the real content from the htm file and assign that to the label.
An html response (should) look(s) like this
<html>
<head>
<title>This page is about software</title>
.......
</head>
<body>
.................
</body>
</html>
The content is between the body tags.
The code takes this appoach
- Read the htm filename form the querystring
- Read in the htm file into the rawHtml string.
- Extract the page title using a regular expression
- Assign the title to the viewerpage's title
- Extract the page body by searching for the body tags
- Assign the body to the text of a label
private void displayArticle()
{
object o = Request.Params["article"];
if (o != null)
{
string pageName = o.ToString();
// read in the htm file
string fullFileName = HttpContext.Current.Server.MapPath(o.ToString());
StreamReader sr = null;
try
{
sr = new StreamReader(fullFileName);
string rawHtml = sr.ReadToEnd();
// Use regex to extract title
Regex reTitle = new Regex(@"<title\b[^>]*>(.*?)</title", RegexOptions.IgnoreCase & RegexOptions.Multiline);
if (reTitle.Matches(rawHtml).Count > 0)
this.Title = reTitle.Matches(rawHtml)[0].Groups[1].Value;
// Plain search to extract body
int bodyStart = rawHtml.IndexOf("<body");
if (bodyStart >= 0)
{
// Find end of body tag
bodyStart = rawHtml.IndexOf(">", bodyStart);
int bodyEnd = rawHtml.IndexOf("</body", bodyStart);
if (bodyEnd < 0)
bodyEnd = rawHtml.Length;
LabelArticle.Text = rawHtml.Substring(bodyStart + 1, bodyEnd - bodyStart - 1);
}
}
catch (Exception ex)
{
LabelArticle.Text = "Article not available";
}
finally
{
if (sr != null)
sr.Close();
}
}
}
For the code to build you need to include System.Text.RegularExpressions in the using list. A regular expression is a nice way to get the title, also when the tags are spelled poorly, like <tiTle >. The Groups[1].Value member returns the title enclosed by the tags. It would be tempting to use a regular expression as well to get the body. But due to the many nested <'s and >'s inside the body that would be a pretty complicated one. And when you manage to figure out a working one there's quite a chance it literally will take ages to evaluate. Here I know there is (maximum) one pair of body tags, a linear search will be fast and good enough.
<Update>
In a comment James Curran writes down one regular expression which yields both results in one go. Which works like a charm and makes the code even simpler.
private void displayArticle()
{
object o = Request.Params["article"];
if (o != null)
{
string pageName = o.ToString();
// read in the htm file
string fullFileName = HttpContext.Current.Server.MapPath(o.ToString());
StreamReader sr = null;
try
{
sr = new StreamReader(fullFileName);
string rawHtml = sr.ReadToEnd();
// Use regex to extract title and body
Regex reHtml = new Regex(@"<title\b[^>]*>(?<Title>.*)</title\b[^>]*>.*<body>(?<Body>.*)</body>", RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = reHtml.Matches(rawHtml);
this.Title = mc[0].Groups["Title"].Value;
LabelArticle.Text = mc[0].Groups["Body"].Value;
}
catch (Exception ex)
{
LabelArticle.Text = "Article not unavailable";
}
finally
{
if (sr != null)
sr.Close();
}
}
}
This was to good not to be included in the full story.
</Update>
The result is that all my classic pages are a full part of the asp.net 2.0 site and are still accessible by the classic url. The reader won't even notice

Posted
04-12-2006 6:32 AM
by
pvanooijen