HowTo – Dealing With Large Databases And Modifying Them For Use

Filed under: Site Management

There are times when you need to work with a large database on a remote host and you do not have control over Apache, PHP and MySQL to alter timeouts and file sizes. This is often the case on shared hosting plans and it effects most websites that get large or that offer a lot of data in a single database.

Recently I was working with a large set of data with +100 megabyte data files that needed to be merged and reconstructed into tables that would eventually store the backend of a website. Although the original files were very large and came close to a Gig in total space the eventual database would be about 10% that size or about 90mb in total.

This may not sound like a lot of data but when you are dealing with constraints of shared hosting it can be difficult.

Working with the data locally was not so dramatic after deciding which tables I wanted to use and which new tables had to be generated the worst part was just waiting for things to happen. After aranging the data in my mysql database I then had to use a script to import the date into a WordPress format.. This took the longest amount of time because each row had to go through the process of generating a new post.

The first try of my PHP Script had me running out of memory … this was not a problem I just set PHP in the php.ini file to use all available memory.

The second problem I ran into was timeouts. The script had to execute for hours .. yes hours .. so I had to alter the timeouts both in php and apache to extend the timeout of processes and execute the script overnight. It actually got to the point where the execution time was heading towards 10 hours.

Finally after all of the data was prepared locally I needed to upload it to a host that wasn’t going to be happy about the amount of time it was taking to insert the data in my new database. Again I chose to perform the work at night during a time when I knew my host’s server was under less load.. this did help to some extent because I wasn’t fighting for resources however … the internet never sleeps.

Heads Up

So, What clues can I give you when dealing with a lot of data that needs to be modified?

1) The first suggestion is that you prepare all your work locally.. either on your desktop or a dedicated computer you use for development and mirroring of your site. If you don’t mirror or develop locally then you better start. Get XAMP or another distribution of Apache, PHP, MySQL and Perl and understand how to adjust settings.

2) Plan your attack. If you are dealing with someone elses raw database then there is probably a lot of data that you need to sort by keys and much you won’t even use.  The first thing you need to do is get a layout of your database. If you can import it locally you can use an app to review the tables. If the database is so large that importation is not easy then strip off the header of a csv file to find the layout or find an app that can give you an .xsd file from an xml export.

3) Reduce and Prepare your Files. Once you know the layout of the raw databases you are working with you can select which data you want to use. If you can import it into mysql then you can drop and edit tables and delete columns that you don’t need.

Export your working files and then re-import them based on keys to merge the many databases into tables that you can use for your project.

Once you have excluded the garbage and arranged your wanted data you have a place to begin building your final database.

4) Test your import scripts before you go through the process of pushing hundreds of thousands of fields around you want to setup a sample database based on real data. Export the first few rows maybe 1,000 rows of each of the tables you plan to work with and pump them though your php script.

If you have 1 million rows and you are testing with 1,000 rows or 50,000 rows then you can calculate the execution times necessary to run your final import. IMPORTANT often scripts will slow over long execution times due probably to the way memory is being handled and not released or CPU Loads. This means you may think you calculated an exact time of completion but it may endup taking exponentially longer.

5) Learn how to recover Unless you work with large databases over a period of time continuously something is going to happen to cause a problem. You may have something as simple as a corrupted file or mismatched character sets that wont show up until 70% through your execution.. you may even find that your layout is not correct and you need to backtrack.

Make backups often.. sure you can copy tables or even databases within mysql but also make exports and label them well or use an index.txt to know exactly where you were when you made the export.

What about Imports and Exports?

Once you have completed all of your work you will need to get it to your webhost. Because most hosting plans won’t allow you to burst for more then a few seconds or minutes database sizes have to be within constraints.

The first thing you want to do is make sure everything works locally then you can examine your tables and make incremental exports and imports to your host.

If you are dealing with many tables or even columns within tables you have the ability to export just that data into manageable sizes and import them separately.  If you are working with a single table you can also export only percentages of the database to keep file size within your host’s limits.

It will take a little bit of trial and error to get your data prepared for upload.

What about backups?

Once your site is live most hosts will have the option to backup your database on their server and then download it as a file. cPanel will do this live for you from the backup widget or you can do it with mysql on a command line. You will have to learn how to do this based on your host’s features.

It will be important to keep your local mirror fresh so when you download your database import it into your local mysql database.

Final Note

As you can see we didn’t cover all of the different applications available for database manipulation or what settings should be used to achieve different results.  Manipulating databases because of the many formats you are likely to run into requires that you collect tools and test them for your own results.

I can suggest that if you are on windows that you first install XAMP or another package that will give you access to all the tools you use on your host.

You will probably need to modify the packages right out of the box but you also want to find your memory and timeout settings.

After you have established your database server you can use tools such as phpMyAdmin, MySqlAdministrator and MySQL Workbench for windows available at the mysql site, Microsoft XML Notepad, and Komodo Edit for raw file editing.

I understand many people also like Navicat which comes in different versions.

I even use Open Office Calc quite a lot to not only layout tables but manage data. It is restricting on large databases because of the row limits but it is a good tool to have.

ANYWAY the basic idea is to know what is in the databases you have… know how you want to use that data and plan its layout… then understand your limitations and work around them.. and make lots of incremental backups as you work..

I hope this is a helpful overview … good luck ..you will need it

heroin