Thursday, July 16, 2020

Testing table to json conversion - importing json in blob back to SQL

When comes to conversions, one of the best ways to verify it is converting back and see if you get the data as expected. Importing JSON to MSSQL from a local disk has become trivial, and MS documentation describe it well. But in practice there is a limited use of loading data from a local disk of the server. So I decided to upload it to a Azure blob container.

Loading from blob storage was a challenge, mainly due to different documents describing parts of it in different ways. Some have changed over time, and other requirements are for different purposes. Once you upload data to blob, this is all you need to do.
  1. Create SAS (Shared Access Signature) from Azure
  2. Create credential in SQL using SAS
  3. Create external data source using credential
  4. Import it using BULK and OPEN JSON to a new table
These are some points to remember - I spent long time to figure out some of them
  1. Create a SAS from Azure - this is not the storage key, but a token created using the key. You can time limit tokens and put some other restrictions too, which are cool.
  2. When copying SAS, omit the first question mark. It is documented everywhere but don't know why it is there in the first place - probably for easy construction of  REST urls...
  3. You have to create a DATABASE SCOPED credential. If not, get a cryptic error message saying credential doesn't exist or you don't have permission. 
  4. You might have to create a master key unless it was already done for something else.
  5. Identity of the credential should be SHARED ACCESS SIGNATURE
  6. You can use any (valid) name for the credential. Some documents ask to use the storage location for the name but it's for a different process, not to load from blob.
  7. When looking for the file, it uses location in data source + path and file from load statement. So you can use the storage root in location and give the file as 'container/file' or use the container in the location itself and give only the file name when loading.
When put all of them together, if the json file in blob is called orders.json and you import it to a SQL table called orderJson, SQL looks like,

CREATE DATABASE SCOPED CREDENTIAL MyContainerKey
WITH IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=xxxxxxxxxxxxxxxxxxx'


CREATE EXTERNAL DATA SOURCE blobdata
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://myblobstorage.blob.core.windows.net',
CREDENTIAL= MyContainerKey)


select * into orderJson
FROM OPENROWSET ( BULK 'mycontainer/orders.json', DATA_SOURCE='blobdata', SINGLE_CLOB) as j
CROSS APPLY OPENJSON(BulkColumn) With (
[OrderID] [int],
[CustomerID] [nchar](5),
[EmployeeID] [int],
[OrderDate] varchar(30),
[RequiredDate] varchar(30),
[ShippedDate] varchar(30),
[ShipVia] [int],
[Freight] [money],
[ShipName] [nvarchar](40),
[ShipAddress] [nvarchar](60),
[ShipCity] [nvarchar](15),
[ShipRegion] [nvarchar](15),
[ShipPostalCode] [nvarchar](10),
[ShipCountry] [nvarchar](15)
) as orderdata

This should be relatively fast,  if it takes minutes, usually the reason is url for location is wrong.


No comments:

Post a Comment