Scraping Structured data from the web using UiPath

February 14, 2018 | Robotics

Structured Data

Scraping structured data from the web/a web browser using UiPath standard activities is not always easy if the data is presented in a structured view within the browser, but HTML is not “that structured”.

A good example is the “Contact Details Card” on LinkedIn. At the time this article was produced the data is actually organised in multiple tables.

Here we present one of the possible approaches to deal with this request.

The basic method is to get the HTML text (“innerHtml” attribute of a surrounding HTML element that contains all data that we need) and use a Regular Expression to extract structured data:

The element we are after is “<div class=”more-info-tray”>” and it’s innerHtml is:

   <table summary=”Online contact information for . There are 2 levels of row headings.”>

      <tbody>

         <tr>

            <th scope=”row”>Emails</th>

            <td>

               <ul>

                  <li><a href=”mailto:timpinchin@gmail.com”>timpinchin@gmail.com</a></li>

               </ul>

            </td>

         </tr>

      </tbody>

   </table>

   <table summary=”Contact information for . There is 1 level of row heading.” class=”no-contact-info-data”></table>

   <hr class=”separator “>

   <table summary=”Internet presence for . There is 1 level of row heading.”>

      <tbody>

         <tr>

            <th scope=”row” class=”linkedin-logo”>LinkedIn</th>

            <td>

               <ul>

                  <li><a target=”_blank” href=”https://uk.linkedin.com/pub/tim-pinchin/15/752/906″>https://uk.linkedin.com/pub/tim-pinchin/15/752/906</a></li>

               </ul>

            </td>

         </tr>

         <tr>

            <th scope=”row”>Twitter</th>

            <td>

               <ul>

                  <li><a target=”_blank” href=”https://www.twitter.com/timpinchin”>timpinchin</a></li>

               </ul>

            </td>

         </tr>

      </tbody>

   </table>

The regular expression we are going to use is:

((.*?<th.*?>(?<area>.*?)<.*?|.*?)<li.*?>.*?href=”(?<url>.*?)”.*?>(?<descript>.*?)<.*?</li>)

We are testing the regular expression using RegEx Tester v. 3.2.0.0:

Having now used a new method to extract structured data (in this case from web pages), we need a facile way to integrate this in UiPath. With this purpose in mind we are going to build a custom activity that will receive two parameters: InputText and Regular Expression and will return a DataTable for all extracted data groups that results from applying the regular expression to the input text:

[Category(“Input”)]

[RequiredArgument]

public InArgument<String> regEx { get; set; }

[Category(“Input”)]

[RequiredArgument]

public InArgument<String> Text { get; set; }

[Category(“Output”)]

public OutArgument<DataTable> Table { get; set; }

The code to extract groups as DataTable is:

Regex r = new Regex(rex);

Match match = r.Match(text);

if (!match.Success)

{

Table.Set(context, null);

return;

}

DataTable dt = new DataTable();

foreach (String groupName in r.GetGroupNames())

{

String columnName = IsNumeric(groupName) ? String.Format(“Group_{0}”, groupName) : groupName;

dt.Columns.Add(columnName, typeof(string));

}

while (match.Success)

{

DataRow dr = dt.NewRow();

       foreach (String groupName in r.GetGroupNames())

       {

              String columnName = IsNumeric(groupName) ? String.Format(“Group_{0}”, groupName) : groupName;

              dr[columnName] = match.Groups[groupName].Value.ToString();

}

dt.Rows.Add(dr);

       match = match.NextMatch();

}

You can find below the complete code for the activity. Using the activity in UiPath:

Which produces the result:

Tip: naming groups using the regular expression will allow us later to use the same names of extracted columns in DataTable to easily address specific groups that we extract.

Related Articles

Introducing Our Robots

January 02, 2019 | Robotics

Good morning and a Happy New Year to all our readers. T-Impact are back in 2019 with a brand new website and we’re looking forward to lighting the way for more successful digital transformations and robotics implementations across the globe. When developing our new site that our old website never answered one important question. Have you […]

LawNet 2018 – Congratulations!

November 29, 2018 | Robotics

Last week, T-Impact had the pleasure of presenting the LawNet Enterprise Award to Gardner Leader LLP at the 2018 Lawnet Annual Conference and Awards Evening Gala at the Hinckley Island hotel in Leicester.

Take advantage of T-Impact & UiPath’s Black Friday & Cyber Monday offer!

November 21, 2018 | Digital Transformation | Robotics

NOW EXTENDED FOR CYBER MONDAY! Black Friday is always a popular time to shop for technology, but rather than buying an extra coffee machine, why not future-proof your organisation, greatly enhance your survival prospects and begin to reap the benefits of automation across your business? Take advantage of T-Impact’s unique Black Friday offer and get three […]