HTML tags remover

HTML Tags Remover

Table of Contents

Web scraping has become a common practice for fetching data from websites and storing it for analysis. However, when storing this data in a SQL Server database, HTML tags can be a hindrance when analyzing the extracted data. In this article, we will explore a method for stripping HTML tags from text data using SQL Server.

The Problem

Consider the following SQL table:

sql
CREATE TABLE HTMLTable ( id int, htmlText varchar(max) );

Let’s say we have filled this table with data extracted from web pages:

sql
INSERT INTO HTMLTable VALUES (1,'<p>Hello <b>World</b>!</p>'), (2,'<div>Welcome to my <span>blog</span></div>'), (3,'Goodbye');

Now, if you select the data from this table, you’ll see that the htmlText field contains HTML tags. These tags can be an obstacle when analyzing the text data.

The Solution

To strip HTML tags from the htmlText column in the HTMLTable, we can create a user-defined function that replaces the HTML tags with an empty string.

Here is the function:

“`sql
CREATE FUNCTION dbo.udf_StripHTML (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT

SET @Start = CHARINDEX('<', @HTMLText)

WHILE @Start > 0 AND @Start < LEN(@HTMLText) AND CHARINDEX('>', @HTMLText, @Start) > 0 
BEGIN 
    SET @End = CHARINDEX('>', @HTMLText, @Start) 
    SET @Length = (@End - @Start) + 1

    IF @Length > 0 
        SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')

    SET @Start = CHARINDEX('<', @HTMLText) 
END

RETURN @HTMLText

END;
GO
“`

This function loops through the HTML text and replaces each occurrence of text that starts with < and ends with > with an empty string. This effectively removes all HTML tags from the text.

To apply this function to the htmlText column in the HTMLTable, you can use the following update statement:

sql
UPDATE HTMLTable SET htmlText = dbo.udf_StripHTML(htmlText);

This updates the htmlText column in the HTMLTable, stripping all HTML tags from the text.

You can also check the data by running the following script:

sql
SELECT * FROM HTMLTable

Conclusion

This method provides a simple way to clean text data from HTML tags in SQL Server. However, it’s important to note that this function does not decode HTML entities and does not handle incorrectly formatted HTML tags. Additionally, regular expressions are not recommended for parsing HTML in a production environment, as there are many edge cases that regular expressions cannot handle correctly. For complex HTML parsing, it is advisable to use a dedicated HTML parser.

The function dbo.udf_StripHTML is a commonly used user-defined function in the SQL Server community for stripping HTML tags from a string. It is not built into the SQL Server system and can be found in various SQL Server forums and blogs online.

Remember to use the function wisely and thoroughly test it in a controlled environment before applying it to production systems. Happy coding!


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *