One of the most interesting new technologies that went viral in 2023 is ChatGPT – the generative AI large language model that retains context across conversations, creating an interactive experience that is uncannily similar to conversing with a human. You probably have experimented with it or may even use it on a regular basis.
I became interested in ChatGPT and decided to conduct several experiments using it in a NonStop context.
Since I work mainly with NonStop databases, I focused my attention on addressing some of the issues common to creating and loading SQL databases. I found the results to be very interesting, encouraging, and useful to the point of justifying further exploration. My hope is that this brief introduction will spur further interest in using this and other AI technologies with NonStop products.
I often need to mock up database environments and quickly prototype ideas, which involves generating and loading data. Creating a database is usually the easy part. Generating and loading representative data is more challenging, especially for a large volume of data. Although many tools are now available to load a database, writing program code to generate data is, well, mere drudgery, especially if only required occasionally.
Yet, this is the area that I wanted to explore with ChatGPT. If it could handle these tasks reasonably well, then perhaps it could be used effectively in other areas.
I have previously relied on the SQL/MX TRANSPOSE capabilities for generating data for SQL tables, and I have written and spoken about this already. It is very simple to use and effective for quickly generating a large volume of data. One challenge, however, is to generate representative data. It’s easy enough to match the proper datatypes, but beyond that, its effectiveness is quite limited. I was most curious to see if ChatGPT could do a better job.
My test scenarios centered on creating a single database table and exploring various ways to use ChatGPT to generate and load data.
Working with ChatGPT
To expect good results from interactions with ChatGPT it is imperative to use effective prompts – these contain the context and questions that help focus ChatGPT on the issue to be addressed. Prompts are a bit of an art form and require some experimentation. If you don’t get the expected results, try modifying your prompts.
For example, when I started my session, I used prompts such as the following:
“Assume you are an experienced DBA and work with the NonStop SQL/MX database.”
“For the following requests, base your answers on the SQL/MX database and its syntax.”
The following sections describe the prompts used and their effects in the progression to generate and load test data efficiently for a SQL table.
Creating the SQL table
I prompted ChatGPT as follows:
“Create a table that represents patient records.”
It did a good job of generating typical columns for such a table. I wanted a few more columns, so I simply requested it do so:
Regenerate the table DDL with a column, Medical History, with the data type varchar (1024).”
“Regenerate the table DDL with two new columns, Admission Date and Discharge Date.”
I was satisfied with the results — they were correct and were generated in less time than manually creating the table.
Generating and loading data
Next, it was time to generate some test data.
With the prompt:
“Generate 10 rows of sample data for the table”,
ChatGPT generated data appropriate for the datatypes. Impressively, it generated representative-looking data – names were real names and of the proper type – first and last names, street addresses and names of cities and states, realistic descriptions, etc. Dates were also representative.
ChatGPT generated a single insert statement with multiple rows, which works fine for small samples but will ultimately fail when the statement length is exceeded. Separate insert statements were needed.
So, a series of interactions began to refine the method for loading data into the table. Tasks like these are best handled incrementally, making one change at a time and continuing accordingly based on the results.
First was to use a separate insert statement for each row generated, with the prompt:
“Make each row a separate insert statement.”
An impressive characteristic of ChatGPT is its ability to reference previous context indirectly – in that sense, it is like a person remembering what you said previously. This makes interactions with ChatGPT very natural.
Next, I wanted this to be a shell script. With the prompt,
“Create a shell script to generate the same results”,
ChatGPT used a bash script to generate data for each column, encompassed that within the proper syntax for the insert statement, and sent the generated results to mxci for processing. All I had to do was to copy that script into an OSS file for execution.
So far, so good for a rather simple experiment.
To generate more rows, the script needed to iterate through a loop to generate values for the insert statements rather than list them serially, inline.
With the following prompts, I was able to produce the desired results:
“Create a script that will generate the row values within a loop instead of using 10 separate lines.”
“Regenerate the script to insert a single row each time within the loop.”
Now, the data values and insert statements were generated by iterating through the loop, passing each insert statement to mxci, which is more scalable than the previous method.
But what about larger volumes of data? How could that be handled?
There are multiple ways to load data into a table. One way is through batch loading. A simple method is to prepare a data file with SQL INSERT statements for each row. This file can then be OBEYed from mxci. It is fast and efficient. That is the next method I would try.
Using a few prompts, I instructed ChatGPT to regenerate the script but direct its output to an OSS file instead of to mxci:
“Instead of piping the output of the script to mxci, simply write the insert statements to an output file.”
“Regenerate the script to produce 100000 rows.”
I was able to obey that file from mxci, with a significant speed improvement.
While that method works for relatively small batches of data, it eventually becomes a problem because the entire batch of data is processed within a single TMF transaction. At some point, something will fail – the audit trail may become full, or the transaction time may be exceeded. It all depends on the volume of data being processed, but it is not a scalable method to use, and it can impact other application workloads. And if the transaction fails, the rollback time may also become significant.
A few more prompts to ChatGPT solved the problem. I instructed ChatGPT to regenerate the script to process batches of data and to use “begin work” and “commit work” statements for each batch:
“Regenerate the script to include a "begin work;" statement at the first of each group of 1000 rows and include a "commit work;" statement at the end of each group of 1000 rows.”
That solved the single-batch problem and made the process scalable. ChatGPT even defined global parameters to set the number of rows to generate as well as the batch size without direction from me.
While that method was very effective, it did require two steps – first, generate a file containing the SQL INSERT statements, and then separately and manually, I had to obey that file from mxci.
Could I simplify the approach further by having ChatGPT simply write a program to generate the data, connect to the database, and insert the rows directly?
I chose a Java/JDBC program. I could have used C/C++ or COBOL, or even Python, but I preferred a Java program running on the NonStop system using the JDBC Type-2 driver for its popularity and convenience.
I requested ChatGPT generate a JDBC program from the bash script it had generated. I first instructed ChatGPT to assume it was an experienced JDBC programmer that worked with the NonStop SQL/MX database:
“Assume you are a DBA and also a Java JDBC programmer.”
“You are working with the NonStop SQL/MX database.”
“Convert the script that follows into a JDBC program that inserts the output directly into the patient table.”
I then pasted the previous bash script into the ChatGPT session.
Without further prompting on my part, the resulting program contained the necessary database connection statements, the iterative steps to prepare and process rows in batches, the insert statement for the table, and the transaction management statements. The program was scalable in the sense that it could be used to generate 100 rows or 10 million rows without failing due to SQL statement length or transaction limitations. Global parameters were used to set the row count and transaction size to control the volume of output. It automatically used prepared statements and JDBC batch inserts for efficiency.
Separate methods were created to assign values to columns from an array of generated values. This modularity makes changes more easily managed.
Only two things were missing: the specific database URL to use for SQL/MX (ChatGPT provided a generic URL that I replaced) and the absence of the few statements that load the SQL/MX driver, which I simply added to the program. Otherwise, the program was compiled and executed successfully.
I found this to be quite impressive. I could not have written this program from scratch so quickly and easily, not even by using a template program. Nor would I have written it so correctly the first time through. Undoubtedly, I would have been perplexed and delayed by debugging programming errors. An experienced JDBC programmer could certainly have written the program faster, but probably not faster than my interactions with ChatGPT. Based only on the previous bash script as an example, ChatGPT generated an efficient, complete working program within a few seconds.
The experiences I have described here are certainly impressive, but I omitted a few details for brevity.
Working with ChatGPT, while very effective, takes forethought and flexibility to deal with unexpected results, which it sometimes produces.
While ChatGPT generated the DDL for the table correctly, it does not do well with the physical attributes for things such as data location, block size, partition definitions, etc. This is likely due to the limited availability of SQL/MX examples used in its training. This is a common problem for domain-specific subject matter. For the time being, these deficiencies will need to be addressed in other ways.
This example also shows the limitations of ChatGPT. While it can produce very good results for well-known subject areas – such as common programming languages, script, etc. – it will not be effective for more obscure topics, such as understanding SQL physical attributes, MXCS commands, and other less popular products and APIs. I would expect ChatGPT to be effective for programming tasks but require manual intervention for NonStop-specific commands and APIs, such as specifying the details of a PathSend statement, for example.
As mentioned at the beginning, constructing meaningful prompts is important for obtaining effective results. Even so, sometimes the results are not what you want, and you need to work creatively with ChatGPT, nudging it with clues and suggestions and guiding it to the results you want.
As you make progress, it is helpful to periodically save the results (along with the set of prompts used) in case you need to start over. Sometimes, your prompts will lead to a path that produces increasingly undesirable results, and the best course of action is to start over. You can then tell ChatGPT to start again, using the script (or program, or other input) that follows, then paste in the material and continue with a new prompt strategy.
Work incrementally and make one change at a time. This approach makes it simpler for ChatGPT to understand your directions.
New technology, such as ChatGPT, is radically changing how we perform work, and we should expect profound changes to continue for the foreseeable future.
The experiments I described here demonstrate some ways in which ChatGPT can be used to simplify certain tasks. It handles common programming tasks reasonably well but struggles with more obscure topics, as explained in the above details. For database-oriented tasks such as generating test data, it shows much potential. Similarly, using it as a programming assistant for generating common code produced effective results. It could certainly be used to generate template programs that are finalized manually. For serious work, its output needs to be reviewed and verified.
Much has already changed in the year-plus since the advent of ChatGPT. I conducted my tests using the GPT 3.5 model. The newer GPT 4.0 model is now available (for a fee), which may produce better results. There are also several other large language models available (some open-source), and other products designed specifically to generate program code, convert source code to other languages (for modernization, for example), and deal with the problems of obscure domain-specific subject matter. These are all worthy of additional investigation.
My purpose for writing this article is to demonstrate some simple yet effective ways that large language models, such as ChatGPT, can be used to simplify certain tasks and improve productivity. I also hope it will generate interest in exploring this subject area more on your own.