Oracle Consulting Oracle Training Oracle Support Development
Home
Catalog
Oracle Books
SQL Server Books
IT Books
Job Interview Books
eBooks
Rampant Horse Books
911 Series
Pedagogue Books

Oracle Software
image
image
Write for Rampant
Publish with Rampant
Rampant News
Rampant Authors
Rampant Staff
  Phone
  252-431-0050
Oracle News
Oracle Forum
Oracle Tips
Articles by our Authors
Press Releases
SQL Server Books
image
image

Oracle 11g Books

Oracle tuning

Oracle training

Oracle support

Remote Oracle

STATSPACK Viewer

    Privacy Policy

 

 
 

Troubleshooting OEM Install Problems

Article by author Chris Foote

 

OEM Troubleshooting  Overview

Like previous versions of Enterprise Manager, the OEM 10G Grid Control is a multi-tier architecture consisting of the HTML console, an OEM management service with an integrated information repository and other OEM management agents running on all monitored instances.

The OEM management service receives monitoring data from the various agents and loads it into the management repository. The OEM management console retrieves data from the management repository, organizes it and then displays it as information to the administrator via the HTML console interface.

The agents are programs that run continuously on all servers that are controlled by the Enterprise Manager architecture. Examples of the more popular targets found on the servers are databases, application servers and listeners.

Agent Installation and Configuration Differences

In Oracle OEM, the agent software is installed separately from the database and application server. During the agent software installation, the installer will prompt the user for the node name of the server that is running the 10G management service. The target agent then contacts the management server and uploads its configuration information. A complete reversal of the way the process was executed in previous releases.

OEM Agent Troubleshooting

But what if we lose communications between the target agents and the management service? Although we are experts by no means, we have done some troubleshooting from time to time.

If alerts and monitoring information isn't being received by the management service, there are only a few components that can be causing the communication breakdown. The problem could be on the management server itself. We'll take a look at troubleshooting that environment in an upcoming blog. The problem may be in the network connectivity, or lack thereof. If you don't have network connectivity, you probably have more problems than just being unable to administer and monitor the target with 10G EM. More than likely you will also be getting calls from irate users letting you know that they can't access the databases on that server. In that case, do what DBAs have been doing for years - blame the network administration boys.

The remainder of this blog will give you a few helpful hints on troubleshooting the target agent software. Let's take a look at our Agent Administration panel in 10G Grid Control. The agent administration screen lists all of the agents currently active in the 10G Grid Control environment. Each line displays the agent software version, status (up, down, problem), number of targets that are using it and the number of targets that aren't using the agent. Each agent's name is a link that allows the user to view more detailed configuration information about that agent. If I click on that link, EM will display a panel that allows me to drill down to more specific information about that agent. The Agent Drill Down panel provides detailed information on the agent's configuration, status, resource utilization, targets monitored and upload information

The most important piece of information on the agent administration panel is the column titled 'Last Successful Load'.

The next section of the agent administration panel provides information on metric collection errors. Each line contains information pertaining to the agent and when the collection error occurred. If Oracle is able to provide additional information, it will display the error message as a HTML navigation link that allows the user to drill down into more specific information. I can either roll my mouse over the navigation link or click on it to display the more detailed error information.

Let's pick one of the errors and go through the error determination process. We'll start with the problem that occurred at the top of our Metric Errors Collection report. 10G Grid Control was unable to run the Agent Process Statistics process on Sept 3 at 6:46:27 AM.

The first step that needs to be executed when performing agent error determination is to log on to the server where the agent is running, navigate to the agent's home directory and run a status command on the agent. The following command displays pertinent information about the status and configuration of the agent running on the target:

> emctl status agent

If you look at the output you will see that we have loaded 606.39 MEGs of XML data to the management server and we have 0 files and 0 MEGs of data pending upload. These numbers are telling us that we have a functioning agent on this platform. If your files pending upload increases continuously, you have an agent to management server problem. If you want to see all of the variations of the EMCTL command, type "emctl" at the prompt and do not enter any other commands after it. Oracle will display a listing of all the variations of the EMCTL command on the screen.

This next screenshot displays the agent's parent directory structures. Notice that I am navigating to the SYSMAN subdirectory. SYSMAN is the parent directory for the subdirectories that contain the diagnostic information that we will need to use as input to solve our problem. We'll be spending the bulk of our time in this directory structure. Although there are numerous files and subdirectories, I will be covering just the files that we have been using to debug our environment. If I don't, I'll have another one of the world's largest blogs.

The CONFIG subdirectory, contains files that are used to tailor the agent to its host platform's configuration. The emd.properties file is the main configuration file that we have had to edit from time to time to fix a few of our issues. The management service that the agent communicates with is identified in the REPOSITORY_URL parameter. If you want to change the host or port name of the management server that this agent communicates with, you will have to edit the REPOSITORY_URL parameter to reflect the new information. Later in this blog, I'll show you how to cleanse the old information from the directories to allow the agent to successfully connect to the new management server.

There is one other parameter in emd.properties that may cause a few problems. If your agent NEVER successfully uploads data after installation, check the TIMEZONE parameter in the emd.properties file. It is usually the last parameter in emd.properties. When our agent's time zones didn't match the time zone in the management server's configuration file, we weren't able to establish a successful connection between the agents and the management service. So if you have agentTZRegion=America/New_York in the agent's emd.properties file, you'll prevent a lot of headaches if you use the same time zone in the management server's emd.properties configuration file. We don't have any servers in different time zones, but there is a wealth of information contained in the 10G Grid Control documentation to help those that do. What I do know is that when the time zones didn't match, we couldn't make a connection.

The EMD subdirectory contains a few files and subdirectories that are important to us. The UPLOAD subdirectory is a holding area for files that will be uploaded to the management server. Lastupld.xml is pretty self explanatory. It contains information on agent uploads to the management server. The file that you may be required to edit from time to time is the targets.xml file. This file contains information about all of the targets (databases, listeners, etc.) controlled by the agent on this platform.

The DBSNMP user is the account that 10G Grid Control uses to log on to the database to perform activities on behalf of the 10G Enterprise Manager toolset. If you change the password for DBSNMP, you will have to edit targets.xml to reflect the new password. To do this, you change the value of ENCRYPTED to FALSE and enter the new password after the VALUE column in the PASSWORD line for that database. I have also had to change the release identifier after upgrading a database being monitored by 10G Grid Control. If some targets are showing up on a platform, while others are not, you may want to review the contents of targets.xml to see if it has entries for the missing targets. Once again, you can use the cleanse script I'll provide you with later in this blog to update targets.xml and upload that information to the management server.

The LOG subdirectory will probably be the area where you spend most of your time when you are evaluating a problem with the agent. This subdirectory contains several files that will assist you in the problem determination process. The most important debugging files are:

emagent.nohup - Agent watchdog log file

emagent.log - Main agent log file

emagent.trc - Main agent trace file

emagentfetchlet.log - Log file for Java Fetchlets

emagentfetchlet.trc - Trace file for Java Fetchlets

There is an excellent bulletin on Metalink that will provide you with detailed information on the contents of these logs. The bulletin number is 229624.1. Let's take a look at the contents of emagent.trc to see if we can find the problem. I have scrolled to the times that match the error that was displayed on the top of our error metrics report. The file contains dozens of lines before and after the time of the error stating that we are having an out-of-memory condition. It looks like we have to perform some further error-determination activities to solve the memory problem before we can fix the agent. This out-of-memory condition is most certainly affecting our database connections so I'll need to fix it soon. Lucky for me it is our DBA "playpen" that we use for our own testing (or I would have a bunch of irate users and/or developers clamoring for my head).

The Cleansing Script

If you go to Metalink and do a search on the value 'sysman/emd/state/*' or document ID 303105.1, you'll see a series of commands that can be used to "cleanse" the environment as we like to describe it here at Giant Eagle. When you get a continuous 'status pending' message for a monitored target, change the name of the management server, reinstall the agent software, remove the agent from the management server and re-add it (and a host of other activities), Oracle recommends that you perform the following steps at the end:

1. Stop the agent on the target node

  2.emctl stop agent

2. Delete any pending upload files from the agent home

rm -r $ORACLE_HOME/sysman/emd/state/*
rm -r $ORACLE_HOME/sysman/emd/collection/*
rm -r $ORACLE_HOME/sysman/emd/upload/*
rm $ORACLE_HOME/sysman/emd/lastupld.xml
rm $ORACLE_HOME/sysman/emd/agntstmp.txt
rm $ORACLE_HOME/sysman/emd/blackouts.xml

3. Issue an agent clearstate from the agent home

emctl clearstate

4. Start the agent

emctl start agent

5. Force an upload to the OMS

emctl upload

If you change the name or port of the management server, you will need to run these commands on all platforms that are running the 10G agents. Because we have been moving our 10G EM Management Service from one server to another during our testing and implementation, we have created a script that automates the above commands. We have also used the above series of commands as a last resort when all other debugging avenues have failed. The script just seems to fix a lot of agent to server communication problems. Set your ORACLE_HOME to the Oracle agent's home directory and run the script. If you execute it by typing clean_up_oms.sh with no arguments, it will display the current ORACLE_HOME, which is required to be set to the agent's home directory to allow the script to process successfully.

 

   

 Copyright © 1996 -2011 by Burleson Enterprises. All rights reserved.


Oracle® is the registered trademark of Oracle Corporation. SQL Server® is the registered trademark of Microsoft Corporation. 
Many of the designations used by computer vendors to distinguish their products are claimed as Trademarks