Hello Devz,
Sometimes it can be useful to copy a part of the content from a website. That’s where web scraping is useful and HTML Agility Pack is one of the best tools to do it. In this tutorial, I will show you a simple HTML Agility Pack example.
Decide what content you need
Say I wanted to have a list of all the countries in the world along with their country codes. It’s possible to do a quick search, find a website listing them and scrape it for the content. Simply open the web page with C# to get the content, find keywords and scrape the data.
Web scraping with this HTML Agility Pack example
HTML Agility Pack is a free and open source tool that is really useful to get the nodes we want from a web page.
In the below code I show you how to do this HTML Agility Pack example to get the country names and codes:
using HtmlAgilityPack;
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
namespace WebScraper
{
class Program
{
static void Main(string[] args)
{
WebDataScrap();
}
public static void WebDataScrap()
{
try
{
//Get the content of the URL from the Web
const string url = "http://www.nationsonline.org/oneworld/country_code_list.htm";
var web = new HtmlWeb();
var doc = web.Load(url);
//Get the content from a file
//var path = "countries.html";
//var doc = new HtmlDocument();
//doc.Load(path);
//Filter the content
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script")
.ToList()
.ForEach(n => n.Remove());
const string classValue = "border1";
var nodes = doc.DocumentNode.SelectNodes($"//*[@class='{classValue}']") ?? Enumerable.Empty<HtmlNode>();
//Write the desired content to a file
using (var file = new StreamWriter("test.txt"))
{
foreach (var node in nodes)
{
//Get the country name
var splittedWords = Regex.Split(node.InnerText, "\n");
var words = splittedWords
.Where(x => !x.Contains(" ") && !string.IsNullOrEmpty(x.Trim()))
.ToList();
if (words.Count() != 4) continue;
var countryName = words[0].Trim();
var countryCode = words[2].Trim();
var result = $"{countryName};{countryCode}";
file.WriteLine(result);
Console.WriteLine(result);
}
}
Console.WriteLine("\r\nPlease press a key...");
Console.ReadKey();
}
catch (Exception ex)
{
Console.WriteLine($"An error occured:\r\n{ex.Message}");
}
}
}
}Note about CSS classes
Of course the way to get the content of a web page will depend on the page itself. This code can’t be generic, but will generally depend on CSS classes name used.
Happy web scraping! 🙂




