Scraping Structured data from the web using UiPath

February 14, 2018 | Robotics

Structured Data

Scraping structured data from the web/a web browser using UiPath standard activities is not always easy if the data is presented in a structured view within the browser, but HTML is not “that structured”.

A good example is the “Contact Details Card” on LinkedIn. At the time this article was produced the data is actually organised in multiple tables.

Here we present one of the possible approaches to deal with this request.

The basic method is to get the HTML text (“innerHtml” attribute of a surrounding HTML element that contains all data that we need) and use a Regular Expression to extract structured data:

The element we are after is “<div class=”more-info-tray”>” and it’s innerHtml is:

   <table summary=”Online contact information for . There are 2 levels of row headings.”>

      <tbody>

         <tr>

            <th scope=”row”>Emails</th>

            <td>

               <ul>

                  <li><a href=”mailto:timpinchin@gmail.com”>timpinchin@gmail.com</a></li>

               </ul>

            </td>

         </tr>

      </tbody>

   </table>

   <table summary=”Contact information for . There is 1 level of row heading.” class=”no-contact-info-data”></table>

   <hr class=”separator “>

   <table summary=”Internet presence for . There is 1 level of row heading.”>

      <tbody>

         <tr>

            <th scope=”row” class=”linkedin-logo”>LinkedIn</th>

            <td>

               <ul>

                  <li><a target=”_blank” href=”https://uk.linkedin.com/pub/tim-pinchin/15/752/906″>https://uk.linkedin.com/pub/tim-pinchin/15/752/906</a></li>

               </ul>

            </td>

         </tr>

         <tr>

            <th scope=”row”>Twitter</th>

            <td>

               <ul>

                  <li><a target=”_blank” href=”https://www.twitter.com/timpinchin”>timpinchin</a></li>

               </ul>

            </td>

         </tr>

      </tbody>

   </table>

The regular expression we are going to use is:

((.*?<th.*?>(?<area>.*?)<.*?|.*?)<li.*?>.*?href=”(?<url>.*?)”.*?>(?<descript>.*?)<.*?</li>)

We are testing the regular expression using RegEx Tester v. 3.2.0.0:

Having now used a new method to extract structured data (in this case from web pages), we need a facile way to integrate this in UiPath. With this purpose in mind we are going to build a custom activity that will receive two parameters: InputText and Regular Expression and will return a DataTable for all extracted data groups that results from applying the regular expression to the input text:

[Category(“Input”)]

[RequiredArgument]

public InArgument<String> regEx { get; set; }

[Category(“Input”)]

[RequiredArgument]

public InArgument<String> Text { get; set; }

[Category(“Output”)]

public OutArgument<DataTable> Table { get; set; }

The code to extract groups as DataTable is:

Regex r = new Regex(rex);

Match match = r.Match(text);

if (!match.Success)

{

Table.Set(context, null);

return;

}

DataTable dt = new DataTable();

foreach (String groupName in r.GetGroupNames())

{

String columnName = IsNumeric(groupName) ? String.Format(“Group_{0}”, groupName) : groupName;

dt.Columns.Add(columnName, typeof(string));

}

while (match.Success)

{

DataRow dr = dt.NewRow();

       foreach (String groupName in r.GetGroupNames())

       {

              String columnName = IsNumeric(groupName) ? String.Format(“Group_{0}”, groupName) : groupName;

              dr[columnName] = match.Groups[groupName].Value.ToString();

}

dt.Rows.Add(dr);

       match = match.NextMatch();

}

You can find below the complete code for the activity. Using the activity in UiPath:

Which produces the result:

Tip: naming groups using the regular expression will allow us later to use the same names of extracted columns in DataTable to easily address specific groups that we extract.

Related Articles

7 ways in which the best law firms are attracting and retaining top talent

June 17, 2019 | Digital Transformation | Legal | Robotics

The best law firms understand that they have to treat their employees as customers, and that means really understanding what matters to them. This article will explore some of those key drivers and how the best firms are turning them to their advantage. 1 – Smart lawyers know that the future rests with the firms […]

The Six Factors Changing Local Government Forever

May 24, 2019 | Digital Transformation | Local Government | Robotics

Many Local Governments stand on the brink of bankruptcy after years of underfunding and budget cuts. Employees are overworked and underpaid, whilst Chief Executives are forced to make difficult decisions which affect the lives of their constituents. T-Impact have identified six key factors that are changing Local Government and senior staff need to adapt to […]

IBM & UiPath partner to offer the ultimate automation solution

May 15, 2019 | Digital Transformation | Robotics

IBM provides the world’s leading workflow solutions. Their business process management suite (BPMS) solution enables organisations to manage complex processes that span many organisations and IT systems. It supports processes that can run for long periods of time and provides dashboards to monitor and track processes, which is especially useful when there are numerous instances […]