SEO and In-site Searching Module Programming - Part 6

Published: 6/3/2011
By: Xianzhong Zhu

In the last several parts of this series you leaned the backend and front-end modules of the Q&A sample application, as well as part of SEO related techniques under the ASP.NET 4.0 environment. In this part, we will shift out attention to delve into how to construct the internal searching module, what kinds of techniques you should have to accomplish such a module, and what kinds of SEO optimization actions should be taken.

Contents [hide]

  • Part 1 In this series of articles, I would like to first introduce the search engine related concepts and technologies, and then through an in-site searching module of an ASP. NET 4.0 sample Web site (a simple Question and Answer site), to show readers how to put all the SEO related goodies into practice.
  • Part 2 In the last article we addressed the importance of an in-site searching engine, the technical difficulties in developing an in-site search engine, as well as solutions to develop an available in-site search engine. In this article, we'll turn to explore another important topic - SEO (search engine optimization), together with a lot of related details and tips.
  • Part 3 In the first two articles of this series we mainly dwelled upon the SEO related theories. What really attracts our interests may be the details and tips in building a practical ASP.NET 4.0 based web application. Starting from this article, we'll focus upon the practical things - developing a commonly-used Q&A module that a real ASP.NET website frequently contains.
  • Part 4 In this part, I will introduce to you the backend sub modules of the Q&A Web application.
  • part 5 In the last part of this series you leaned the backend sub modules composed of the small Q&A sample application, as well as part of SEO related techniques under the ASP.NET 4.0 environment. In this part, we will shift out attention to delve into the foreground part.
  • Part 6 In the last several parts of this series you leaned the backend and front-end modules of the Q&A sample application, as well as part of SEO related techniques under the ASP.NET 4.0 environment. In this part, we will shift out attention to delve into how to construct the internal searching module, what kinds of techniques you should have to accomplish such a module, and what kinds of SEO optimization actions should be taken.
  • Introduction

    In the last several parts of this series you leaned the backend and front-end modules of the Q&A sample application, as well as part of SEO related techniques under the ASP.NET 4.0 environment. In this part, we will shift out attention to delve into how to construct the internal searching module, what kinds of techniques you should have to accomplish such a module, and what kinds of SEO optimization actions should be taken. You will see we are going to resort to LINQ to Entities to execute the searching operation. And also, C# Regular Expression will play an important role in optimizing such a module and rendering user-friendly searching results.

    NOTE

    The sample test environments in this series involve:

    1. Windows 7;

    2. .NET 4.0;

    3. Visual Studio 2010;

    4. SQL Server 2008 Express Edition & SQL Server Management Studio Express.

    Introduction to Regular Expressions

    Regular expressions are used to find and match strings (of course, can also be used for substitution). Regular expressions provide a powerful, flexible, and efficient way to deal with text. Comprehensive pattern matching notation of regular expressions allows you to quickly analyze large amounts of text to find specific character patterns; extract, edit, replace, or delete text substrings; or add the extracted strings to the collection in order to generate a report.

    For many applications to deal with strings (such as HTML, log file analysis and HTTP header analysis), regular expressions are an indispensable tool. With the help of regular expressions, you can implement the following features:

    1. Test the string model within a given string

    For example, you can test the input string to see if the phone number pattern or credit card number pattern occurs within a specified string. This is often called data validation.

    2. Replace the text

    You can use a regular expression to identify the specific text in the document, completely remove it or replace it with other text.

    3. Extract a substring from a string based on a matching pattern

    You can use regular expressions to find specific text within a document or an input field.

    In a certain sense Regular expressions constitutes a language. As a language, the regular expression has its own syntax and words (elements). When the language is integrated into C#, it shows more powerful features.

    C# provides the Regex class (defined in the namespace System.Text.RegularExpressions) to represent an immutable regular expression. The Regex class provides plenty of methods associated with regular expressions match, substitution, and verification. Due to space limit and delving into it will be far from our main topic, you can refer to MSDN and the famous regular expressions tutorial site to find out details.

    Create a Site Search Module

    On the bases of the questions and answers modules established previously, let's now create an independent search module for the Q&A sample application. On the whole, the searching module consists of the following crucial areas and technical points:

    Note that the query method that is described in this article is still based on LINQ to Entities leveraged before. Next, we will first set up the fundamentals of the searching module, and then we will explore the related local optimization policies.

    Establishing an Universal Search Entrance

    Generally speaking, the search entry can be placed in a special page, or in a master page of the global Web site or even in a local module. In our case, we will build the search portal on the master page of the Q&A module. Now look at the search entry located inside the master page Site.Master. Figure 1 illustrates the design-time screenshot.

    Figure 1: The design-time screenshot of the master page Site.Master and search entrance

    The design-time screenshot of the master page Site.Master and search entrance

    Note our interested point currently concentrates upon the searching entrance at the upper right corner. As for another searching entrance at the lower part we'll delve into it in a future article particular at the buffering support and related optimization for the searching functionality.

    Below indicates the main markup code associated with the basic searching support inside the master page.

    For simplicity, we've not utilized tips to show more friendly prompt info at the textbox. In fact, there are numerous existing solutions for this; you can search it through Internet or implement your own. Let's next continue to look at the behind coding related to the button Button1.

    Here we use Server.UrlEncode to encode the content in the TextBox control txtKeyword. In this way, when passing it as the URL parameter we can achieve better compatibility while not have side effect upon the server URL resolve. The searching bar provides basic function, passing the searching parameter to the search result page leaving it to be responsible for related query and output, rather than do this within a PostBack request of the master page. Such a decision will result in the following advantages:

    (1) Separation of logic

    Not implementing complex logic in such non-functional pages as the master page or search page will optimize logic, so that special pages do special things.

    (2) Easier access

    In the above code, directly jumping to the search results page and providing the "kw" parameter for URL can make the URL be directly and repeatedly visited. In this way, the user can bookmark this URL, so that the next time he can directly access the results list that complies with the same search conditions. Users can also spread the URL, so that other visitors directly access to the results list, without having to re-enter data to search. You will not enjoy this if you use the PostBack logic to make direct output.

    (3) Facilitate the migration and upgrade

    Passing the parameter independently is equivalent to building up an interface, so you can always pass parameters to the query module to achieve different query results. This is also a common practice in distributed systems - although the users access the same page from the same server the background processing server may be on another server, when this server can only accept a parameter and returns a corresponding result.

    Create a Search Results Page

    Now that we've implemented the searching entrance and set up a way to passing parameters, it is time for us to create another new page to show the searching result.

    Design the markup code

    First of all, let's look at the key markup code associated with the page SearchResult.aspx (within the master page Site.Master), as follows.

    Here, a Repeater control is used to show all the questions related info. The most important part should be the HyperLink control embedded in the ItemTemplate template. As designed, we use URL routing technique to navigate the current user to the page Question.aspx that displays the detailed info associated with the current question. And also, we use the extension method HighlightKeyword to highlight the keyword in the question title.

    Next, let's go to look at the behind C# code.

    Behind code design

    Here is the main code of the Page_Load event handler for the file SearchResult.aspx.cs.

    The general logic above is not difficult to understand. First, use Page.RouteData.Values to obtain the passed keyword. Then, construct a complex LINQ to Entities statement to grab the question data that meet the specified conditions. As pointed out in the previous articles, such inquire solution may result in low efficiency. Cute readers can consider using stored procedures inside the database to improve the solution. At last, we bind the data to the Repeater control. That's all.

    Write the extension method HighlightKeyword

    In a more user-friendly search result page, it is better to highlight the user query keywords. As a simple approach, you can use the following way:

    However, there are two main disadvantages in this solution:

    To overcome the above shortcomings, we'd better use the above-covered regular expressions. This is achieved in an extension method called HighlightKeyword in the public static class Common.

    Extension method is a new feature introduced since C# 3.0. The method HighlightKeywords above is just such one, as an extension method for the string type. For more details about extension method, please refer to MSDN; we'll omit the detailed introduction.

    As for the above method, we'll introduce a little more to let readers gain a better understanding:

    (1) string replaceFormat = "<strong><font color=\"red\">{0}</font></strong>".

    Using this statement we define the highlighted string format. In this case, we make them red. {0} represents a placeholder of the keyword, to help to generate the complete regular expression related string.

    (2) string expr = @"(?<!{0})(?<kw>{0})";

    This statement is used to define the regular expression string. The character @ can help to bypass some "\" related escapes, but not to omit the internal escape symbol inside a regular expression. {0} bears the same meaning as above. Using the above regular expression, two identical characters can not be close together. In addition, kw represents the name of the group (an important element associated with regular expressions).

    Up till now, a basic internal searching module has been finished.

    Optimize the In-site Searching Module

    Till now, we've finished setting up a fundamental internal searching module. During the whole process of the implementation, some basic ideas have been shown in building up an in-site search module. However, some of the policies taken for now are now robust. The things mainly focus upon the several aspects below:

    As for the searching efficiency, we'll dwell upon it in the next caching topic in this series. Now, let's look at the rest points mentioned above.

    Matching accuracy

    In the previous section, we've achieved the target of highlighting the keywords in the searching result page. It seems pretty beautiful, doesn't it? However, there are big loopholes in such simple regular expressions, one of which is matching accuracy. Next, let's consider a concrete example.

    For example, if the user is searching the keyword 'long', then he will find in the matched questions, such as "long long ago", multiple keywords close together will all be highlighted. This is not what the user wants.

    For the application, it is not wrong to highlight all keywords, such as "long", using the regular expressions. However, for the user, it seems that he just enters the keywords "long long". This case requires improving the wording of the regular expression to make the matched keywords discontinuous.

    To solve the above problem, you need to fall back on the atomic zero-width assertion in regular expressions. Table 1 shows the commonly-used atomic zero-width assertion symbols.

    Table 1: Commonly-used atomic zero-width assertion symbols

    Assertion

    Explanations

    Examples

    ^

    Matches the position at the start of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position following \n or \r. When used as the first character in a bracket expression, ^ negates the character set.

    ^\d{3} matches 3 numeric digits at the start of the searched string. [^abc] matches any character except a, b, and c.

    $

    Matches the position at the end of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position before \n or \r.

    \d{3}$ matches 3 numeric digits at the end of the searched string.

    \A

    Match must appear at the beginning of the string

     

    \Z

    Match must appear at the end of the string or before the newline \n at the end of the string

     

    \z

    Match must appear at the end of the string

     

    \G

    Match must appear in the place where the last match ends

     

    \b

    Matches a word boundary; that is, the position between a word and a space.

    er\b matches the "er" in "never" but not the "er" in "verb".

    \B

    Matches a word non-boundary.

    er\B matches the "er" in "verb" but not the "er" in "never".

    Let's now return to the above topic. We should modify the extension method HighlightKeyword in the file common.cs, like the following (note the bold part):

    Now, with the initial regular expression (?<kw>{0}) prefixed with (?<!{0}), the same contents close together will no more exist.

    First, the character @ indicates that the string following it is a verbatim string. Note that the character @ only applies for constant string (for instance a file path). Second, a string starting with @ can span several lines, so that writing, in the .cs file, JavaScript or SQL script becomes more convenient. Refer to the following:

    Third, in the C# specification, the character @ can be used as the first character of an identifier (class name, variable name, method name, etc.) to allow a reserved keyword in C# as a custom identifier. Refer to the following:

    Note that although @ appears in the identifier, but it cannot be part of the identifier itself. Therefore, in the above example, we define a class named class, which contains a static method named static and a parameter called bool.

    In addition, with the symbol @ positioned before a string within a pair of double quotes some of the escape symbols can be omitted, as used in the extension method HighlightKeyword.

    User experience

    As far as the user experience is concerned in the present searching module, there are still plenty of aspects to improve, such as:

    Of course, besides the above two points, there are still many areas required to improve. But since these two points are consistent with the general user's habits and representative, we'll only focus upon them and give corresponding solutions.

    1. Keywords remaining

    Since when we implemented the searching function we used the method Response.Redirect() to jump to other page, as well as passing the related parameters, the ViewState cannot hold the entered keyword on the master page. The reason is during the course of the new GET request the textbox for the keyword in the master page will be cleared up. So, at this time, a same keyword requires to be passed to the master page and make the keyword related textbox visible, so that when the current user views the result page he can also see the keyword remain at the textbox. How can we achieve this effect?

    Generally, you use the following code, i.e. in the master page of the search results page via the FindControl() method, to locate the keyword input box related server control and set its Text property:

    However, this method has a most significant drawback: strongly coupling the user control ID in the master page with the content page, which not only results in the efficiency issue but makes the ID attribute of the keyword entry box server control in the master page can not be easily changed, so that it is very detrimental to maintenance. To solve this problem, you can, in the master page, set the access level of the control txtKeyword to public:

    And also, in the content page use strong cast for Master to cast the Master type to the type of the current master page and to get from this type the target server control, and at last set the value of the server control's property:

    On the surface, this approach seems to be a good solution to overcome the drawbacks of using the FindControl() method. But, at the time of solving this problem, another new problem arises. That is, if specified in the page SearchResult.aspx another master page, you need to modify the type at all places where the same master page is referenced, which in turn has strengthened the coupling between the background .cs code and the front end. Another problem with this approach is to make the protected property of the txtKeyword server control changed to public itself is an unsafe practice, which makes txtKeyword accessible outside the Master page.

    So, how to achieve the targets of not only to meet the above requirements, but also to decouple the relationship between them? To overcome this knot, it is necessary to introduce a new thing, <% @ MasterType%>, and define a property for the master page.

    As the name implies, <% @ MasterType%> is the type of master page, set in an aspx page, and with the position same as <% @ Page%>, generally placed in front of the entire page. MasterType has a VirtualPath attribute, with which you can specify the type of master page. In actual use, however, you only need to specify the master page's address, the same setting as the MasterPageFile property of <% @ Page%>.

    Here, we use <% @ MasterType%> in the page SearchResult.aspx to define the master page type:

    Thus, the type of Master in the page SearchResult.aspx is that of the file Site.Master, whose public methods can be easily visited. However, the "public" here is not the public of the control txtKeyword, but for it to create a new public accessor. In the Site.Master file add the following code:

    In this way, we can from the page SearchResult.aspx visit the Keyword property of the file Site.Master and set it value. The related code is given below.

    Next, let's discuss another user experience related issue – keyword highlight.

    2. Keywords highlight

    When the user from the search results page enters into the details page, if keyword highlighting feature can also be provided for the details page, this will help users to see key words within the shortest time to obtain the desired information.

    The highlighting feature for the details page is same as highlighting the keyword, decoding the keyword parameter of URL to obtain the initial keyword and then invoke the extension method HighlightKeyword to mark the specified contents. The detailed steps are shown below.

    (1) In the page Question.aspx.cs, add the property KeyWord:

    (2) In the Page_Load event method add the following statement to set the KeyWord:

    (3) Before using the preceding Master.Keyword = KeyWord, as with the page SearchResult.aspx, you specify the master type for the page Question.aspx:

    (4) Modify the method LoadInfo() called in the Page_Load event method of the page Question.aspx, so that the highlighting keywords can be applied to the specified area:

    (5) Besides the question content and the best answer, if you also want to highlight the contents in other answers, then you can modify the code in the Repeater control in the page Question.aspx:

    Keywords filtering issue and URL routing

    Keyword filter in the search engine query is extremely important, because some important keywords may be involved in a database query and regular expression matching, which may affect the accuracy of inquiries or matching, so that application security and stability may be seriously affected.

    Take our internal search engine for the Q&A module as an example. If you enter special characters, such as "?", "/", "&", an HTTP 404 error will be thrown out by the system. In another word, URL routing does not allow such symbols as parameters passed to a URL.

    Figure 2: HTTP 404 Error is thrown out when passing special characters as url parameter

    HTTP 404 Error is thrown out when passing special characters as url parameter

    Now, let's look at our coding for the above case. In the file Site.Master.cs:

    According to MSDN, the UrlEncode() URL encoding ensures that all browsers will correctly transmit text in URL strings. Characters such as a question mark (?), ampersand (&), slash mark (/), and spaces might be truncated or corrupted by some browsers. As a result, these characters must be encoded in tags or in query strings where the strings can be re-sent by a browser in a request string.' That is, the first statement is valid and allowed in general code. But, as soon as the second statement is executed, the exception like that in Figure 2 will be thrown out. So, we can say, the ASP.NET 4.0 URL routing does not allow the above special characters as parameters passed into a URL.

    In addition, such an ugly HTTP 404 Error page as given in Figure 2 is unfriendly. Readers can create your own special exception thrown page to handle the similar things in your real scenarios.

    Finally, in a real project, SQL inject is another serious issue to consider. Since this is beyond the range of this article, we'll no more detail into it.

    Summary

    Well, till now, we've succeeded in building up a basic and still elementary in-site searching engine for the Q&A sample application. As you've seen, we've only implemented basic functionalities and provided limited optimization policies for the module. There are still more deserved to be researched into in a real project. In the next and the last article, we'll explore the possible buffering policy in constructing an internal search engine for the Q&A sample application.

  • Part 1 In this series of articles, I would like to first introduce the search engine related concepts and technologies, and then through an in-site searching module of an ASP. NET 4.0 sample Web site (a simple Question and Answer site), to show readers how to put all the SEO related goodies into practice.
  • Part 2 In the last article we addressed the importance of an in-site searching engine, the technical difficulties in developing an in-site search engine, as well as solutions to develop an available in-site search engine. In this article, we'll turn to explore another important topic - SEO (search engine optimization), together with a lot of related details and tips.
  • Part 3 In the first two articles of this series we mainly dwelled upon the SEO related theories. What really attracts our interests may be the details and tips in building a practical ASP.NET 4.0 based web application. Starting from this article, we'll focus upon the practical things - developing a commonly-used Q&A module that a real ASP.NET website frequently contains.
  • Part 4 In this part, I will introduce to you the backend sub modules of the Q&A Web application.
  • part 5 In the last part of this series you leaned the backend sub modules composed of the small Q&A sample application, as well as part of SEO related techniques under the ASP.NET 4.0 environment. In this part, we will shift out attention to delve into the foreground part.
  • Part 6 In the last several parts of this series you leaned the backend and front-end modules of the Q&A sample application, as well as part of SEO related techniques under the ASP.NET 4.0 environment. In this part, we will shift out attention to delve into how to construct the internal searching module, what kinds of techniques you should have to accomplish such a module, and what kinds of SEO optimization actions should be taken.
  • Please visit the link at the below url for any additional user comments.

    Original Url: http://dotnetslackers.com/articles/aspnet/SEO-and-In-site-Searching-Module-Programming-6.aspx