Warren F bio photo

Warren F

Systems Engineer with a penchant for PowerShell, science, cooking, information security, family, cookies, and the Oxford comma.

Connect

@pscookiemonster LinkedIn Github Stackoverflow TechNet RSS Feed My old blog

My GitHub Repos

AppVReporting BuildHelpers Citrix.NetScaler Git-Presentation InfoBlox Invoke-Parallel PowerShell PSDepend PSDeploy PSDiskPart PSExcel PSHTMLTable PSRabbitMQ PSSlack PSSQLite PSStash RabbitMqTools SecretServer

Overview

Rambling

We occasionally get requests we can’t refuse. Do I have any interest in working with spreadsheets? Not particularly. But if my boss, or my bosses boss asks me to merge two spreadsheets based on a common column, I don’t want to say “sorry, no can do.”

I also don’t want to get stuck being the spreadsheet guy. So I wrote a quick PowerShell function that merges two sets of data, found a more flexible but slow alternative from Lucio Silveira, and settled on extending a nice modification from Dave Wyatt.

The result is yet another Join-Object.

Overview

Let’s look at a silly concocted example. I have a spreadsheet of managers, and a spreadsheet of departments. For some inexplicable reason, I need to match up departments to their manager’s birthday.

Managers

Departments

If you’ve spent any time with T-SQL, you’ve probably seen the handy Venn diagrams (source) out there, and the discussions on why Venn diagrams don’t match join statements alone.

Let’s take a look at how we might join the manager and department data.

Left joins

A left join is fairly self explanatory - we include all rows from the left data set (managers), and if anything matched on the right side (departments), we include that data:

Left Join

Inner joins

An inner join is the intersection of both data sets. We return only data where a row on the left matched up with a row on the right:

Inner Join

Full joins

In some cases you want all the data, regardless of whether it is in either set:

Full Join

Let’s use PowerShell to join this data up!

Join-Object

Join-Object can be found in my hodge podge repository of PowerShell functions. Here’s a quick rundown on the parameters:

  • Left - a collection of objects for the left side - you can pipe in data for this
  • Right - a collection of objects for the right side
  • LeftJoinProperty - The property on the Left collection whose value must match RightJoinProperty in the Right collection
  • RightJoinProperty - The property on the Right collection whose value must match LeftJoinProperty in the Left collection
  • LeftProperties (optional) - If specified, limit properties we keep from the left to this set
  • RightProperties (optional) - If specified, limit properties we keep from the right to this set
  • Prefix and Suffix (optional) - If specified, add a prefix or suffix to all Right collection property names. Quickly avoid collisions.
  • Type (optional) - What type of join:
    • AllInLeft - Left join
    • AllInRight - Right join
    • OnlyInBoth - Inner join
    • AllInBoth - Full join

That was painful! Let’s look at this in practice, where it’s a bit easier to see what’s going on.

Demo data

I’ll use PSExcel to pull in data from a spreadsheet. Keep in mind there are other options, like the fantastic ImportExcel module from Doug Finke, which PSExcel borrowed from.

$L = Import-XLSX -Path C:\temp\JoinTest.xlsx -Sheet 1

Managers in PS

$R = Import-XLSX -Path C:\temp\JoinTest.xlsx -Sheet 2

Departments in PS

We have the data, how do we join it together?

Inner join

Join-Object -Left $L -Right $R -LeftJoinProperty Name -RightJoinProperty Manager -Type OnlyIfInBoth -RightProperties Department

We can tell Join-Object that we only want rows where $L.Name is equal to $R.Manager (inner join), and to only return the Department property from the $R collection

Inner Join

Left join

Here’s where things get more interesting. Let’s run Lucio’s code first:

Missing Department

What happened! We only got back a name and birthday, where is the department that I wanted?

It turns out that PowerShell will see the first object in the pipeline, which only has a name and birthday property, and display output with only those two properties.

We could manually select Name, Birthday, and Department, or use Format-List to see the other properties, but I like abstraction, so in the extended Join-Object, we select the full set of properties for output.

Join-Object -Left $L -Right $R -LeftJoinProperty Name -RightJoinProperty Manager -Type AllInLeft -RightProperties Department

This time, we get the expected output, without worrying about missing columns:

Left Join

Full join

There’s another issue you can run into. What if you have two properties with the same name, with differing values and meaning? In our example, we have a set of managers, where name refers to the manager’s name. We also have a name property on the department set, that refers to the department name.

Let’s look at the practical implication if we don’t account for this:

Full Join, Overwrite

Interesting, the left values for name were overwritten by the right values. How can we fix this? The simplest solution is to add a prefix or suffix to all properties from the right collection:

Join-Object -Left $L -Right $R -LeftJoinProperty Name -RightJoinProperty Manager -Type AllInBoth -Prefix r_

Now we get all the properties, and nothing is overwritten:

Full Join, prefix

Maybe you are more familiar with calculated properties. The RightProperties parameter can take individual properties, calculated properties, or a mix:

Join-Object -Left $L -Right $R -LeftJoinProperty Name -RightJoinProperty Manager -Type AllInBoth -RightProperties @{ N = "DeptName"; expression = {$_.Name} }

Rather than a generic prefix or suffix, we can use calculated properties to rename these conflicts, or manipulate their values:

Full Join, calculated property

Let’s take a peak at performance!

Performance

Lucio’s script offers some pretty cool flexibility, allowing you to specify a custom ‘Where’ scriptblock rather than assuming we want data where one value is equal to another. Unfortunately, Add-Member and invoking those scriptblocks takes a bit more time.

As you can see, this flexibility comes at a pretty steep cost, even when comparing two relatively small data sets:

Version Seconds
Lucio’s 145.0
Dave’s ~1.0
Warren’s ~1.2

Practical example: Active Directory Input

The request:

“Hey Warren, we need to match up SSNs to Active Directory users, and check if they are enabled or not. I’ll e-mail you an unencrypted CSV with all the SSNs from gmail, what could go wrong?”

The code:

# Import some SSNs. 
$SSNs = Import-CSV -Path D:\SSNs.csv

#Get AD users, and match up by a common value, samaccountname in this case:
Get-ADUser -Filter "samaccountname -like 'wframe*'" |
    Join-Object -LeftJoinProperty samaccountname -Right $SSNs `
                -RightJoinProperty samaccountname -RightProperties ssn `
                -LeftProperties samaccountname, enabled, objectclass

The result:

Join-Worksheet

Join-Worksheet

It’s a bit simpler to just use Join-Object, but you can find a crude Join-Worksheet function in PSExcel:

Join-Worksheet

Closing

That’s about it! If you like this sort of thing and haven’t worked with SQL yet, read up on T-SQL. You can use simple PowerShell functions like Invoke-Sqlcmd2 or the PSSQLite module to work with MSSQL and SQLite, respectively, or use the many other tools out there.

If you have any suggestions or run into any issues, a pull request would be welcome : )

Disclaimer: title image source